I Let GPT-5 Build It. Codex Found the Bugs.

What That Taught Me About Choosing the Right LLM

Nov 09, 2025

I let GPT-5 build it. Codex found the bugs.

That’s the moment I stopped believing one model can do it all.

It wasn’t about prompts. It was about fit. GPT-5 could architect a full app, but Codex — tuned specifically for software reasoning — caught the subtle design flaws GPT-5 missed. That contrast became the clearest example I’ve seen of why picking the right model for the right task is everything.

Artificial intelligence isn’t one giant brain. It’s a collection of specialized systems, each built with a different purpose.

But somewhere between the hype cycles and the endless “AI will change everything” headlines, people started treating large language models as if they’re interchangeable. They drop GPT, Claude, Gemini, or LLaMA into their stack and expect them to perform the same. They don’t.

Choosing the right model for the right task is the new differentiator — the line between reliable results and avoidable failure. The right model delivers accuracy, speed, and context. The wrong one burns time, budget, and trust.

Model Fit Defines Reliability

Every model is trained with a goal in mind. The training data, objectives, and reinforcement methods shape how it performs in real-world use.

GPT-5 Codex (OpenAI)

Deep reasoning in code
Best for: software development, automation

Claude 2.1 / 3.5 / 4 family (Anthropic)

Long-context comprehension and reliability
Best for: research, contract review, communication

Gemini 2.5 Pro (Google)

Multimodal reasoning
Best for: data and document analysis

BloombergGPT

Domain-specific precision
Best for: finance and regulatory work

LLaMA 3 / Code LLaMA (Meta)

Fine-tuned control
Best for: cost-sensitive or on-prem AI systems

Reliability comes from alignment: a model’s training must match the task. GPT-5 Codex is built on codebases and understands dependencies. Use it to draft ad copy, and you’ll get logically sound sentences but less functional code.

Claude models, meanwhile, are tuned to minimize hallucinations. When uncertain, they’ll say “I’m not sure” — a feature, not a flaw, for work where accuracy matters more than personality.

When the Right Model Delivers

Coding and Automation

OpenAI’s GPT-5 Codex remains a standout for tooling, debugging, and multi-file refactoring. It finds logic issues a general model misses.

While testing apps I had vibe-coded with GPT-5 and others, Codex surfaced missing dependency injections, redundant API calls, and malformed JSON payloads — design flaws, not syntax. Codex didn’t just autocomplete code; it understood it.

Anthropic’s Claude Opus 4.x and Sonnet 4.x score well on SWE-Bench, but in real-world debugging Codex delivers stronger implementation-level fixes, while Claude excels at explaining intent.

Beyond Code: The Rise of Non-Coding LLMs

Non-coding models like Claude Sonnet 4, GPT-5 (base), and Gemini Pro outperform code models when tasks require context, judgment, or communication.

They summarize research, interpret nuance, and adapt tone better than code-trained systems. On reasoning benchmarks (MMLU, BBH, ARC) they score in the 90th percentile or higher — graduate-level comprehension.

Code models think in functions.
Reasoning models think in context.
The best results come from using both — one to build, one to explain.

Summarizing Long Documents

Claude’s higher-tier models handle about 200 000 tokens (≈ 500 pages) in one go. GPT-4 / 5 can do long texts too, but usually by chunking, which risks losing continuity.

Chunking vs Memory

GPT-4 and GPT-5 often split inputs into smaller segments to fit context limits, which can fragment meaning.
Claude 2.1 and 4 Opus can process entire documents in a single pass.
Large context = higher cost + latency but fewer lost connections.
Anthropic Docs: Context Windows

Legal, Financial, or Regulated Work

Recent real-world cases show why factual grounding matters.

May 2025 — Utah attorney sanctioned for citing fabricated cases (The Guardian)

July 2025 — Alabama lawyers disciplined for “completely made-up” citations (AP News)

September 2025 — California lawyer fined $10 000 after AI-invented 21 of 23 citations (The Daily Record)

These stories prove that legal and financial work needs domain-tuned models like BloombergGPT or highly aligned Claude Opus / Sonnet, which minimize hallucinations and verify sources.

Coding Model Benchmarks (2025)

Claude Opus 4.1 (Anthropic) – ~74.5% SWE-Bench Verified

Source: Anthropic News

Claude Sonnet 4 (Anthropic) – ~72.7% (base) / ~80% parallel compute

Source: Entelligence AI

GPT-5 Codex (OpenAI) – ~74–77% SWE-Bench / 51% refactoring vs 33% base GPT-5

Source: CodeGPT Blog

Gemini 2.5 Pro (Google) – ~63.8% SWE-Bench Verified

Source: DeepMind Blog

Code LLaMA 34B (Meta) – ~53–56% HumanEval / MBPP

Source: Open Laboratory AI

A Practical Guide for Non-Technical Users

You don’t need to understand model architectures to use AI well.

You just need to know what kind of model fits your workflow — and why the wrong one quietly wastes time or money.

Summarizing reports / research

Look for: large context windows (≥ 100 K tokens)
Models: Claude 4 Opus (200 K), GPT-5 (128 K), Gemini Pro (1 M)
Why: reduces chunking errors — the model remembers everything in one pass.

Writing blogs / marketing copy

Look for: balanced tone and reasoning
Models: Claude Sonnet 4, GPT-5 (base), Gemini Pro
Why: blends creativity with structure for mixed audiences.

Customer support / HR tasks

Look for: speed and low cost
Models: GPT-3.5 Turbo, Claude Haiku, Mistral 7B
Why: small models are cheap and effective for routine text.

Contracts / compliance reviews

Look for: source verification and low hallucination
Models: BloombergGPT, Claude Opus 4, GPT-5 + Retrieval
Why: domain LLMs handle legal and financial language safely.

Data analysis / spreadsheets

Look for: code execution and reasoning
Models: GPT-5 (ADA), Gemini Flash
Why: they can calculate, chart, and summarize data directly.

Brainstorming / tone testing

Look for: creativity and empathy
Models: Claude Sonnet 4, Gemini Pro, GPT-5
Why: these models explore voice and style beyond templates.

How to Decide Without Getting Technical

Start with the problem, not the model. Ask what matters most: judgment, accuracy, or speed.
Use general models (GPT-5, Claude Sonnet 4) for mixed tasks.
Use specialized models for high-stakes domains (law, finance, code).
Don’t overpay for power you won’t use. Small models are fine for everyday work.
Remember: context window = attention span — bigger matters only for long docs.

In plain terms:

General models (GPT-5, Claude Sonnet, Gemini Pro) = your analyst
Domain models (Codex, BloombergGPT, Code LLaMA) = your expert consultant
Small models (Haiku, GPT-3.5, Mistral) = your efficient assistant

The Takeaway

Choosing the correct LLM isn’t a technical detail — it’s a reliability strategy.

The gap between good enough and production-ready comes down to model fit. When I watched Codex find bugs in code I’d generated with GPT-5, it reminded me that specialization matters.

Claude can reason beautifully, Gemini can plan across modalities, and open models can be tuned for privacy — but if the task is surgical debugging or logic validation, I want a model trained to think like a developer.

Used well, LLMs amplify human capability. Used poorly, they amplify human error.

The difference isn’t your prompt — it’s which brain you choose to trust.

Tech with Darin

Discussion about this post