We rebuilt the structured output problem one layer up

In late 2023, every JSON-extracting system prompt I shipped for production tooling looked roughly like this:

You are a JSON API. Respond ONLY with valid JSON.
Do not include explanations. Do not wrap in markdown.
Do not say "Here is the JSON".
The output is parsed by a strict parser.
If the JSON is invalid, downstream systems break.
Return only the JSON object.

I am embarrassed by every line of it. I am also, in early 2026, still using a near-identical version in three production systems, because it works and because nothing else available at the time worked better.

That prompt is an artifact of an era, and the era has a story worth telling because it's already repeating itself.

A field guide to the era

The structured-output era runs roughly from late 2022 to August 2024. Here is what shipped, with verified dates:

Date Artifact What it solved What it cost
Nov 10, 2022 Microsoft Guidance First control-flow language for LLM generation Tightly coupled to the inference loop
Mar 17, 2023 Outlines (dottxt-ai) Grammar/regex-constrained sampling Required control over the sampler
May 6, 2023 Glazkov's "Schemish" post "Use JSON Schema as reasoning rails" pattern Still relied on the model cooperating
Jun 13, 2023 OpenAI function calling First major API-level structured-output mechanism Schema-conformance was best-effort
Jun 14, 2023 jxnl/Instructor Pydantic models as the API contract; auto-retry on validation A wrapper around prompts, fundamentally still coercion
Jul 2023 llama.cpp grammar sampling GBNF-constrained decoding for open-source models Local-only, complex grammar files
Nov 6, 2023 OpenAI JSON mode (DevDay) response_format: json_object, model trained to emit valid JSON Schema-aware? No. Just "valid JSON."
Late Dec 2023 Eric Hartford's Dolphin "kitten" prompt; Theia Vogel's tipping experiment Demonstrated absurd lengths people went to for instruction-following Made the era look ridiculous in retrospect
Feb 2024 Hamel Husain's "Fuck You, Show Me The Prompt" Skepticism: most "structured output" libraries are prompt manipulation Became the canonical counter-narrative
May 2024 Anthropic Claude tool use (GA) Tools-as-first-class, parallel to function calling Two incompatible tool-calling shapes now
Aug 6, 2024 OpenAI Structured Outputs (strict schemas) "100% reliability" for JSON Schema conformance on gpt-4o-2024-08-06 The end of phase one
Nov 2024 Anthropic Model Context Protocol Tool-and-resource standardization across providers Phase two begins; new shape, same coercion
Dec 4, 2024 Pydantic AI Pydantic-validated agent loops Wrapper layer around tool calling
2026 Anthropic — Code execution with MCP Tool calls don't scale; agents should write code instead A retreat to a more natural abstraction

If you were paying attention in 2023, every row in that table felt like progress. We were inventing the abstractions at the same time we were using them in production. The Instructor library was created on June 14, 2023, the day after OpenAI launched function calling on June 13. That is the pace.

Looking back from 2026, the table reads as a single arc with a beginning, a middle, and an apparent end. The beginning is "ask the model nicely, then more nicely, then with threats." The middle is a Cambrian explosion of libraries trying to put structural rails on prompt engineering. The apparent end is OpenAI's August 2024 launch of strict-schema Structured Outputs, which claimed 100% schema conformance on gpt-4o-2024-08-06.

Phase one closed with confetti. We had won the war.

Then phase two started, and we noticed it was the same war.

The libraries that mattered

The libraries weren't the headline at the time. The model launches were. But the libraries were doing the actual work of teaching us what structured output is, and they're worth naming.

Microsoft Guidance was the earliest of them, with its repo created in November 2022, before ChatGPT had finished its launch news cycle. Guidance pioneered the idea of treating LLM output as something you compose with control flow, regex constraints, and grammars. Most of the patterns we now take for granted (JSON-from-grammar, structured generation as opposed to structured prompting) trace back here.

Outlines, launched in March 2023 by .txt, was the cleanest expression of grammar-constrained decoding. The thesis was straightforward: if you can write a grammar for the output, the sampler should refuse to emit anything that violates it. This is a profoundly correct idea that, three years later, is finally becoming the default in open-source inference engines like vLLM and XGrammar.

Instructor, shipped by jxnl on June 14, 2023, took a different path. It wrapped OpenAI's brand-new function calling and pretended the result was structured output. Define a Pydantic model, get a Pydantic model back. Retries handled. Validation handled. It was the right abstraction at the right moment, and it's part of why the term "structured outputs" stuck as the framing for the whole field.

llama.cpp added grammar-based sampling in July 2023, and that's when the conversation about constrained decoding got serious in the open-source world. The fact that you could ship a four-line grammar file and force a 7B model to emit perfect JSON every time was, at the time, witchcraft.

LangChain's output parsers, especially OutputFixingParser, standardized the retry-and-repair loop: parse, fail, send the validation error back to the model, ask it to fix the JSON, parse again. That pattern is everywhere now. It's also a lot of what people complain about when they complain about LangChain.

These libraries split into three approaches. Constrained decoding (Outlines, Guidance, llama.cpp grammars) intervenes at sampling time. Schema-driven wrappers (Instructor, later Pydantic AI) lean on the underlying API and add validation. Retry-and-repair loops (LangChain, Guardrails) treat the LLM as an unreliable black box and validate after the fact. All three still ship.

The fork in the road

By mid-2024 the field had bifurcated into camps that didn't always realize they were having the same argument.

The constrained-decoding camp said: structure should be enforced at sampling time, by the inference engine, before the model can emit anything wrong. Outlines and llama.cpp grammars are this. So is OpenAI's Structured Outputs feature, which almost certainly uses grammar-constrained decoding under the hood, building on research from llguidance and XGrammar.

The instruction-tuned camp said: train the model to emit schema-conformant output. OpenAI's function calling was the first major implementation. Anthropic's tool use, when it went GA in May 2024, was the second. The argument was that for closed-source APIs you don't control sampling, so you have to bake schema awareness into the model itself.

The wrapper camp said: it doesn't matter how the model emits the JSON, you still need validation, retries, and provider-agnosticism. Instructor and Pydantic AI (publicly launched December 4, 2024) are the cleanest expressions of this.

What actually won? All three did, layered on top of each other. For most app developers using closed APIs, OpenAI Structured Outputs and Anthropic tool use are the default. Underneath them, constrained decoding is the implementation. On top of them, validation wrappers like Instructor handle retries and content checks the schema can't enforce.

Hamel Husain wrote the canonical skeptical essay about wrappers in February 2024 (Fuck You, Show Me The Prompt), and he was right that many of these libraries are mostly prompt manipulation. He was also wrong that this means they're not useful. Prompt manipulation, well-engineered and validated, was the bridge that got us from "respond ONLY in valid JSON" to schema-strict APIs. Bridges are useful. They're also temporary.

Three lessons we learned

The era taught us things, and the things are worth naming clearly because they're falling out of working memory already.

The model wasn't broken; the interface was

In early 2023 the consensus was that LLMs "couldn't follow instructions reliably." By late 2024 the consensus was that LLMs "follow JSON schemas with 100% reliability." The models hadn't fundamentally changed in that window. The interface had. When a model seems unreliable at a task, the most productive question to ask is "what does the API surface look like?" before "is the model good enough?"

Every workaround eventually becomes infrastructure

The retry-and-repair loop was a hack. It became LangChain's OutputFixingParser, then Instructor's tenacity-backed retries, then a built-in part of Pydantic AI. The "respond ONLY in valid JSON" prompt was a hack. It became, near-verbatim, the default system-prompt example in OpenAI's own documentation for years. The lesson is to take your hacks seriously, because the half-life of a "temporary" workaround in this field is approximately five years.

The constraint doesn't disappear, it moves up the stack

The 2022 problem was: how do I get this model to emit a parseable JSON object. By 2024 we had solved that. The 2025 problem became: how do I get this model to pick the right tool from a list of 250 of them, with each tool's schema preloaded into the system prompt, and not blow my context window before the model has emitted a single token. The shape of the problem didn't change. The layer changed. We're in the middle of the same pattern again.

Why we're forgetting the lessons

If the lessons of the structured-output era were propagating, the tool-calling and MCP era would look different than it does. It doesn't.

The clearest public example: in late 2025 / early 2026, Anthropic published an engineering post titled Code execution with MCP: building more efficient AI agents. The argument, in their own framing, is that direct tool calls don't scale because each tool definition consumes context, and a five-server MCP setup with 58 tools can burn ~50,000 tokens before the user has typed anything. Their proposed fix is to let the model write code that calls tools, instead of expecting the model to pick from a flat list of pre-loaded tool schemas. They report that lazy-loaded tool discovery improves Claude Opus 4.5 task accuracy from 79.5% to 88.1% while cutting tokens by roughly 85%.

That paragraph deserves to be re-read. Anthropic, the company that designed MCP, is publicly arguing that the way MCP currently works has a fundamental scaling problem, and the fix is to abandon flat tool lists in favor of letting the model write code. That maps cleanly onto the lessons above. The model isn't bad at tool calling; the interface is. The "load all your tools upfront" pattern is becoming infrastructure even though it was always a hack. The constraint moved up: from "make the model emit valid JSON" to "make the model pick the right tool from a flat list of 250."

A second example, with a useful counterpoint: Waleed K's piece The Evolution of AI Tool Use: MCP Went Sideways makes a related observation about MCP's context-bloat problem with concrete numbers and a concrete war story. He argues, drawing on Cloudflare's framing, that LLMs are "bad at tool calling" because tool-call traces are "out-of-distribution" for the base models. I think he's half right. The base distributions are absolutely thin on canonical tool-call traces, which is exactly why model providers fine-tune for tool use and why Anthropic's lazy-loaded tool search lifts Opus 4.5 from 79.5% to 88.1%. But the framing "models can't do tool calling" misses the same point we missed in 2022 about JSON. The model isn't bad at the task. We've handed it a clumsy interface to the task. Code execution feels like a fix because it routes around the clumsy interface and lets the model do something it has trillions of training examples for: write code. That's the same insight as "use a JSON Schema grammar." It's "the interface was wrong, again." The lesson generalizes, and it's the lesson worth carrying forward.

A third example, drawn from less rigorous evidence: production systems I've worked on in 2026 still ship system prompts that say things like "you are a backend tool router. Output ONLY a single tool call. Do not explain. Do not apologize." Strip the word "tool" and replace with "JSON object" and you have my 2023 prompt. Same shape, same hack, new layer.

What the next "JSON repair library" looks like

Three predictions, with low hedging.

First, tool-search-as-routing becomes the default agent design pattern within twelve months. Flat tool lists in the system prompt will look as embarrassing in 2027 as "respond ONLY in valid JSON" looks now. The tool-search tool is the new JSON-Schema strict mode.

Second, agent-as-code-author beats agent-as-tool-picker for any non-trivial workflow. Anthropic's MCP-code-execution post is the most prominent signal, but the same pattern shows up in the way Claude Code agents get work done internally and in OpenAI's evolving Responses API semantics. We'll look back at "load 250 tools into the system prompt" the same way we look back at the kitten prompt.

Third, whatever standard replaces or absorbs MCP will be a protocol-of-protocols: a thin layer that brokers between code-executing agents and the underlying tool servers, rather than a flat schema dump. MCP solved the "describe your tools" problem. The next standard has to solve "let the model discover tools as needed without paying upfront context."

I have been writing system prompts that look like 2023 prompts again, this time around tool calling, and so has everyone else. The fix is the same one we figured out the first time: stop coercing, change the interface.

If you're shipping AI features that hit this same wall, SimbaStack builds the systems that route around it.

NJ

NJ runs SimbaStack, an AI consulting and development studio shipping agent-native systems for SMBs. He also builds KaribuKit, an AI-native hotel platform that started in safari lodges; Mara Hilltop, the safari lodge that proves the platform out; and SlopIt, the agent-first publishing API. Based in Kenya.

#frontier#mcp#tool-calling#structured-outputs#llm-history