Best LLM For Coding As Of May 2026: Opus 4.7 And GPT-5.5, The Copilot And The Cracked Dev

For months I brute-forced complex problems with a single agent. Lovable experiments, an experimental RAG pipeline in Claude Code, attempts to chain dynamic workflow agents with API access. I kept telling myself the models weren’t good enough yet. The truth was simpler. I hadn’t built a system that allowed them to develop production grade work.

The system that finally worked is not complicated. There is a loop with seven stages: brainstorm, initiative, plan, plan review, execution, code review, fixer. I run more than one frontier model inside that chain so they keep each other honest. Once I started working that way, the difference showed up fast.

That loop now runs many of my daily tasks at the company I work for, organizes and documents my work cycles, and handles every side project I touch. It is not a coding tool. It runs my newsletters, video generation, SOPs, and documentation. Most of my work can be solved programmatically, and the system works for just about all of it. From inside it, “which LLM is best for coding” is the wrong approach. The real question is which models fit where in a structured agent-loop hierarchy. That is where Opus 4.7 and GPT-5.5 Codex split into clearly different jobs.

I use Opus as the architect. It is the model I want on bigger picture work, brainstorming, and high level planning. I use GPT-5.5 Codex as the cracked dev who lives in his mom’s basement and gets shit done. Both are extremely capable. The differences start to show the moment you throw a complex initiative at them. If “best llm for coding” means “wins a one-prompt benchmark,” the question is uninteresting. If it means “still pulling weight on cycle four of a hard problem,” the answer depends on which job you are asking about. The rest of this post is how I split those jobs.

This split is fresh. The jump from Opus 4.5 and Codex 5.3 to Opus 4.7 and GPT-5.5 Codex was a real gear change, not a minor version bump. The roles I describe below would not have been this clean six months ago.

The real fork is not claude code vs codex or claude vs gpt. It is which seat each one is sitting in when the work gets hard. A single-model loop drifts. Two models with clear ownership at each stage do not.

The Loop

Each stage has one owner. The first three are Opus. The last four are Codex. The handoff happens at the plan boundary, where the architect hands the spec to the builder and gets out of the way.


+--------------------------------------+
| 1. BRAINSTORM            Opus 4.7    |
| shape the problem                    |
+--------------------------------------+
                  |
                  v
+--------------------------------------+
| 2. INITIATIVE            Opus 4.7    |
| frame scope and goals                |
+--------------------------------------+
                  |
                  v
+--------------------------------------+
| 3. PLAN                  Opus 4.7    |
| contracts, waves, ACs                |
+--------------------------------------+
                  |
                  v
+--------------------------------------+
| 4. PLAN REVIEW       GPT-5.5 Codex   |
| stress test the plan                 |
+--------------------------------------+
                  |
                  v
+--------------------------------------+
| 5. EXECUTION         GPT-5.5 Codex   |
| implement + tests                    |
+--------------------------------------+
                  |
                  v
+--------------------------------------+
| 6. CODE REVIEW       GPT-5.5 Codex   |
| find regressions                     |
+--------------------------------------+
                  |
                  v
+--------------------------------------+
| 7. FIXER             GPT-5.5 Codex   |
| repair, rerun checks                 |
+--------------------------------------+

Where Opus wins

Opus is the model I want when the work is shape-finding. Brainstorm sessions where I am not yet sure what the right system looks like. Initiatives where the question is “what are we building, and what are we explicitly choosing not to build.” ADR sessions where two reasonable options exist and someone has to commit to one.

I have watched Codex botch design jobs that Opus then walked through cleanly. Codex is not dumb. Codex just wants to start typing. Opus is comfortable sitting with the question longer, and that is the trait that matters at the front of the loop.

The same instinct shows up on exploratory ADRs. Opus pushes back. It asks the second question. It writes the version of the doc that survives someone else reading it next quarter. That is the kind of work I do not want to lose to “ship faster.”

Where Codex wins

Codex wins when the work needs miles. Implementation, tests, verification, fix loops, all of it. Hand it a clear contract and it goes from spec to merged change without complaint.

The place I notice it most is debugging. When I am stuck on a complex race condition, a weird state bug, or a regression that hides on the third reproduction, Codex is the one who finds it. It will read more code than Opus, run more probes than Opus, and re-attempt more times before throwing in the towel.

It also does not sulk. If fix one fails, it tries fix two with the same energy. That sounds small. In a long debugging session it is the difference between landing the fix and walking away tired.

If “which llm is best for coding” means raw shipping velocity, Codex is the obvious answer. It turns minutes into merged code, and it does it cycle after cycle.

Where each one fails

Opus is really great for almost all tasks. It just sometimes gets lost in big codebases and produces solutions that need a refinement pass after. The architect impulse is what makes Opus strong at the front of the loop and clumsy at the back. Opus can handle a five-line change just fine; it just needs to be contained, because it will sometimes take initiative to rework code you did not ask it to touch.

Codex is also great. It is just less of a free thinker. Big picture work, exploratory design, ambiguous prompts. That is not its lane. What it is, is the ultimate workhorse for projects that are well defined. Hand it a clear task inside a clean structure and it rarely misses the mark and rarely needs a second pass.

This is why I do not believe in single-model evangelism. Both models are excellent. Both are wrong in predictable ways, and the loop exists to catch each one with the other.

The verdict

If you make me answer “best llm for coding” or “best ai model for coding” in one line, here it is. Use GPT-5.5 Codex when you need raw implementation velocity. Use Opus 4.7 when you need bigger picture thinking and real design work. Stop pretending those are the same job.

If you cannot support role splitting yet, start where your pain is loudest.

Pain is design quality, ambiguous scope, or unclear direction: start with Opus.
Pain is throughput, debugging, or fix-loop drag: start with Codex.

Then graduate to a two-model loop. That is where the real lift shows up, because each model gets to do the work it is best at and the other one catches its blind spots.

A note on the search side. People still type opus 4.6 vs gpt 5.4 even though the conversation has moved to 4.7 and 5.5, and I still watch OpenRouter rankings because market behavior matters. That search behavior is not noise. It tells me what people actually want: not ideology, just role reliability.

So my committed take as of May 2026: Opus is the architect. GPT-5.5 Codex is the cracked dev who lives in his mom’s basement and gets shit done. Put them in a disciplined loop. Let each do the job it is best at. Judge outcomes by what ships without waking you up tomorrow.

The Loop

Where Opus wins

Where Codex wins

Where each one fails

The verdict

Andrew Bembridge