Structured output: getting reliable JSON from any LLM (2026)

Structured output is where a lot of LLM prototypes stop being demos and start becoming software.

It is easy to build a chatbot that returns fluent text. It is much harder to build a production system that emits JSON your application can trust. Pipelines break on missing keys, extra prose, malformed arrays, trailing commas, wrong enum values, and schema drift across retries. The model may look "mostly correct" to a human and still be unusable to code.

That is why structured output is not a prompt trick. It is an interface design problem. If an LLM sits inside a workflow that triggers actions, writes to a database, populates UI components, or calls downstream services, free-form prose is usually the wrong contract. What you need is validated structure.

Why unstructured output breaks pipelines

Humans are tolerant of ambiguity. Software is not.

If a support summarizer returns:

Customer is upset about delayed refund.
Priority: high
Recommended next step: escalate to billing

that may be perfectly readable to a person. But it is not enough for a system that expects:

{
  "issue_type": "refund_delay",
  "priority": "high",
  "next_action": "escalate_billing"
}

The gap matters because text that is "close enough" for a human is still broken for code.

Typical failure modes from unstructured output:

The model wraps JSON in Markdown fences
The model adds an explanation before or after the JSON
A required field is missing
A field uses the wrong type
An enum value is semantically correct but not one of the allowed values
Nested objects drift from the schema over time

That is why production systems should stop asking, "Did the model answer well?" and start asking, "Did the model satisfy the contract?"

Schema-constrained generation, application-side validation, and retry loop on failure

JSON mode vs constrained decoding vs schema validation

These three ideas are related, but they are not the same thing.

JSON mode

JSON mode tells the model to output valid JSON syntax. That is useful, but limited.

If you only use JSON mode, you may get:

Valid JSON with the wrong keys
Valid JSON with the wrong types
Valid JSON with missing required fields
Valid JSON that does not match your application schema

JSON mode solves syntax reliability better than semantic reliability. It is helpful, but it is not full schema enforcement.

Constrained decoding

Constrained decoding restricts what the model is allowed to generate so the output conforms more tightly to a schema or grammar.

This is much stronger than "please return JSON." It reduces the space of possible outputs during generation itself. In practice, that means fewer malformed responses and less post-processing work.

When people say structured outputs are now much more reliable, this is often what they mean. The model is not merely encouraged to follow a schema. The decoding path is constrained around it.

Schema validation

Schema validation happens after generation. You parse the model output and check it against a schema definition, often JSON Schema or a typed model such as Pydantic.

Validation is essential even if you use vendor-native structured output features, because:

Your application still needs a typed object
You may support multiple model vendors
Some workflows involve fallback or repair steps
Validation lets you detect drift explicitly instead of guessing

In production, the safest pattern is usually not one of these in isolation. It is a layered approach:

Ask for structure using the strongest vendor-native feature available
Validate the result against your application schema
Retry or repair if validation fails

OpenAI Structured Outputs API

OpenAI's structured output support is the cleanest version of schema-first generation in a hosted API today.

The official OpenAI docs describe Structured Outputs as a way to supply a JSON Schema so the model returns data matching that schema, rather than simply "some JSON." This is stronger than old-style JSON mode because the schema becomes part of the generation contract.

Why this matters in practice

This changes the development workflow:

Define the schema first
Let the model generate into that contract
Parse the result as typed data

That is much better than generating free-form text and trying to recover structure with regexes or fragile parsers afterward.

What OpenAI's approach is good at

OpenAI's current structured-output approach is especially strong when:

Your schema is known up front
You need deterministic application fields
The output will flow directly into code
You want less repair logic in the happy path

It is a good fit for extraction, classification, route planning, action arguments, compliance flags, UI configs, and other machine-consumed outputs.

Where it still needs engineering

Even with Structured Outputs, you still need:

Sensible schemas
Validation on the application side
Retry logic for provider or transport failures
Monitoring around schema success rate

Strong vendor support reduces errors. It does not remove the need for production discipline.

Anthropic tool use for schema enforcement

Anthropic's strongest structured-output pattern is tool use.

In the official tool-use docs, Anthropic defines tools with an input_schema field using JSON Schema. In practice, this means you can describe the shape of the tool arguments you want, and the model responds by producing tool input that matches that schema instead of improvising prose.

This is not exactly the same product surface as OpenAI's Structured Outputs, but the practical result is similar: you get structured, machine-usable data with a schema contract.

Why tool use is a strong structured-output pattern

Anthropic's tool pattern is especially good because it aligns structure with action.

If the model is going to:

create a ticket
classify a message
extract entities
plan a workflow step

then tool arguments are often the cleanest place to enforce schema. The output is not "text about the action." It is the typed payload for the action.

When Anthropic tool use is the right choice

This pattern is strongest when:

The result naturally maps to a tool or function call
You want schema enforcement plus action routing
You are already building agent-style workflows
Your application treats structure as input to another system

If your use case is extraction without downstream tools, you can still use tool definitions as a structured-output interface. That often works better than trying to parse plain assistant text.

The Pydantic + Instructor pattern

Vendor-native structured output is useful, but many teams want a cross-provider application layer. That is where the Pydantic + Instructor pattern is valuable.

Instructor is a Python library built around typed output models. Its official docs describe using Pydantic models as the desired response shape and letting the library handle parsing and retries around provider calls.

This is a strong pattern because it moves your real contract into application code.

Why Pydantic helps

Pydantic gives you:

Required fields
Type validation
Enums
Nested models
Useful validation errors

That means your schema is not scattered across prompt text. It lives in code, where the rest of your application can depend on it.

Why Instructor helps

Instructor adds a practical wrapper around this pattern by:

Calling the model
Parsing the response into the Pydantic model
Retrying when validation fails

That is exactly the sort of glue production teams end up writing anyway. Using a library for it reduces repeated plumbing.

When this pattern is best

The Pydantic + Instructor pattern is especially useful when:

You support multiple vendors
Your typed schema matters more than a provider-specific feature surface
You want validation and retries unified in Python
Your backend already uses Pydantic models

It does not replace vendor-native features. It gives you a strong application layer over them.

Schema design matters more than people think

Many structured-output problems are not really model problems. They are schema problems.

If your schema is confusing, over-nested, inconsistent, or semantically vague, even a strong model will struggle to produce clean output. Teams often blame the provider when the deeper issue is that the contract itself is badly designed.

Keep schemas narrow

A production schema should be as small as the workflow allows.

Every extra field creates more failure surface:

another place for enum drift
another missing-value risk
another business rule to validate
another prompt burden on the model

This is why "return everything we might ever want" is usually a bad idea. The best schema is often not the most complete one. It is the smallest one that supports the downstream system.

Prefer explicit enums over vague text

If the application needs routing logic, do not ask for open-ended descriptions where a controlled enum would do.

Weak:

{"priority_reason": "This looks pretty urgent"}

Stronger:

{"priority": "high"}

You can still keep a free-text explanation field if humans need it. But machine logic should depend on constrained values whenever possible.

Separate machine fields from human explanation

One of the cleanest production patterns is to split outputs into:

fields the system needs for logic
fields humans may want for explanation

For example:

route = "billing"
confidence = "high"
explanation = "The user mentions duplicate charges and refund delay."

That pattern keeps the machine-facing contract stable without sacrificing interpretability for operators or reviewers.

Avoid deep nesting unless the domain truly needs it

Deeply nested JSON looks elegant in design docs and often behaves badly in practice.

If the model has to build three layers of nested arrays and objects, failures increase:

required children go missing
ordering becomes inconsistent
retries become harder to reason about

Flatten where you can. Nest only where it reflects the actual domain model and reduces ambiguity.

Retry and repair loops

Even with schema-aware generation, structured output should be treated as a reliability pipeline, not a single API call.

Why? Because failures still happen:

Provider responses time out
The model chooses a near-miss enum value
A field violates a business rule the JSON Schema did not capture
An upstream prompt change causes subtle drift

That is where retry and repair loops come in.

Retry

A retry loop asks the model again, usually with:

the original task
the validation error
the same schema

This works best when the first failure is a minor mismatch rather than a fundamental misunderstanding of the task.

Repair

A repair loop tries to fix malformed or semantically wrong output using either:

a cheaper model
the same model with a narrower instruction
deterministic code when the issue is simple

Repair is useful when the output is close enough to salvage. For example:

strip Markdown fences
coerce simple types
map synonymous enum labels
fill missing optional fields with defaults

The danger is over-repair. If you repair too aggressively, you can hide real model failures instead of surfacing them.

The best production approach is usually:

parse strictly
repair only small, well-understood issues
retry on true schema failures
alert if failure rates climb

Observability and testing for structured output

Structured output should be monitored like any other production interface.

If a normal API started returning malformed payloads 6% of the time, nobody would call that acceptable. LLM output contracts should be held to the same standard.

Metrics that actually matter

The most useful metrics are usually:

schema pass rate
retry rate
repair rate
business-rule failure rate
average attempts per successful call

These metrics help you distinguish between prompt issues, model drift, schema design problems, and provider instability.

For example:

high parse failures usually mean syntax or contract issues
high business-rule failures usually mean the schema is too weak
rising retry rates after a prompt edit usually mean the prompt shifted the model away from the contract

Keep a structured-output test set

The easiest way to break a production extractor is to change the prompt or model without a test set.

A good regression set should include:

normal cases
edge cases
ambiguous inputs
adversarial formatting
known failure examples from production

Then test two things separately:

schema validity
semantic correctness

Those are not the same. A response can validate perfectly and still classify the input incorrectly.

Review failures as categories, not anecdotes

When structured outputs fail, do not just fix the single broken sample and move on. Label the failure type:

syntax
schema shape
enum drift
business rule
misunderstanding of task

That turns debugging into system improvement instead of prompt superstition.

Common failure modes

Structured output failures are predictable. That is good news because predictable failures can be instrumented.

1. Syntax is valid, schema is wrong

This is the classic JSON mode problem. The model returns valid JSON, but the shape is wrong.

Fix:

Use schema-constrained generation when available
Validate with Pydantic or JSON Schema
Fail fast instead of silently accepting drift

2. Enum drift

The model returns "urgent" when the schema expects "high".

Fix:

Use explicit enums
Show examples when the categories are subtle
Add a narrow repair map only for safe synonyms

3. Missing nested fields

The outer object is correct, but nested structure is incomplete.

Fix:

Keep nesting only as deep as necessary
Mark truly required fields as required
Retry with the validation error attached

4. Mixed prose and JSON

The model adds "Here is the JSON:" before the object or wraps it in code fences.

Fix:

Prefer vendor-native structured output features
Strip code fences only as a first-pass repair, not as the main solution
Treat repeated occurrences as a prompt or provider-contract problem

5. Business-rule violations

The JSON matches the schema, but the result is still wrong for your application. For example, the date format is valid but outside an allowed range.

Fix:

Separate schema validation from business validation
Feed business-validation errors back into retries
Log business-rule failure rates independently

Five predictable failure modes and targeted production fixes

When to avoid structured output

Not every LLM response should be forced into JSON.

Structured output is the right tool when the downstream consumer is code. It is often the wrong tool when the downstream consumer is a human who needs flexible narrative explanation.

Avoid or minimize structured output when:

The task is exploratory and open-ended
The user mainly wants a narrative answer
The structure would be mostly empty or artificial
You are forcing prose into JSON just to feel "production-ready"

This matters because over-structuring has a cost:

More prompt complexity
More schema maintenance
More retries
Less expressive answers

Sometimes the best pattern is hybrid:

structured object for machine logic
optional free-text explanation for humans

That gives your application what it needs without pretending every task is naturally a schema.

Python example: schema plus retry loop

This example shows the core production pattern:

define a schema in code
call the model
validate the result
retry with the validation error if needed

# pip install -U openai pydantic
from typing import Literal
from pydantic import BaseModel, ValidationError
from openai import OpenAI
 
client = OpenAI()
 
class TicketSummary(BaseModel):
    issue_type: Literal["refund_delay", "bug_report", "account_access", "other"]
    priority: Literal["low", "medium", "high"]
    next_action: str
    customer_sentiment: Literal["negative", "neutral", "positive"]
 
def extract_ticket_summary(message: str, max_retries: int = 2) -> TicketSummary:
    errors = None
 
    for attempt in range(max_retries + 1):
        repair_note = ""
        if errors:
            repair_note = f"\nPrevious validation error: {errors}\nReturn corrected JSON only."
 
        response = client.responses.create(
            model="gpt-5.4-mini",
            text={
                "format": {
                    "type": "json_schema",
                    "name": "ticket_summary",
                    "schema": {
                        "type": "object",
                        "properties": {
                            "issue_type": {
                                "type": "string",
                                "enum": ["refund_delay", "bug_report", "account_access", "other"]
                            },
                            "priority": {
                                "type": "string",
                                "enum": ["low", "medium", "high"]
                            },
                            "next_action": {"type": "string"},
                            "customer_sentiment": {
                                "type": "string",
                                "enum": ["negative", "neutral", "positive"]
                            }
                        },
                        "required": [
                            "issue_type",
                            "priority",
                            "next_action",
                            "customer_sentiment"
                        ],
                        "additionalProperties": False
                    },
                    "strict": True
                }
            },
            input=(
                "Extract a structured support summary from this message.\n"
                f"Message: {message}"
                f"{repair_note}"
            ),
        )
 
        try:
            return TicketSummary.model_validate_json(response.output_text)
        except ValidationError as exc:
            errors = str(exc)
 
    raise ValueError(f"Could not produce valid structured output after retries: {errors}")
 
summary = extract_ticket_summary(
    "I was charged twice and support still hasn't fixed the refund after five days."
)
print(summary.model_dump())

This example uses OpenAI's schema-based output format, but the reliability pattern is portable. The same application structure works with Anthropic tool inputs or a Pydantic + Instructor wrapper.

A practical production pattern

If you want one durable rule for structured output in production, use this:

Define the schema in code
Use the strongest native structured-output feature your provider offers
Validate in the application anyway
Retry with validation errors when useful
Monitor schema success rate like any other production metric

That is the difference between "the model usually gives JSON" and "the system reliably produces typed data."

What this means

Reliable JSON is not a prompt flourish. It is how you make LLMs interoperable with the rest of your software stack.

If the model output is going to trigger logic, write records, feed workflows, render UI, or call tools, structure should be the default assumption. The right question is not whether you can get JSON once in a notebook. It is whether your system can keep getting valid, typed, enforceable output after prompt changes, model upgrades, and edge-case inputs.

That is why the winning production pattern is layered: native structure when available, schema validation in code, and explicit retry or repair when needed. Once you treat structured output as a contract instead of a suggestion, LLM systems become much easier to debug, test, and trust.

Why unstructured output breaks pipelines

JSON mode vs constrained decoding vs schema validation

JSON mode

Constrained decoding

Schema validation

OpenAI Structured Outputs API

Why this matters in practice

What OpenAI's approach is good at

Where it still needs engineering

Anthropic tool use for schema enforcement

Why tool use is a strong structured-output pattern

When Anthropic tool use is the right choice

The Pydantic + Instructor pattern

Why Pydantic helps

Why Instructor helps

When this pattern is best

Schema design matters more than people think

Keep schemas narrow

Prefer explicit enums over vague text

Separate machine fields from human explanation

Avoid deep nesting unless the domain truly needs it

Retry and repair loops

Retry

Repair

Observability and testing for structured output

Metrics that actually matter

Keep a structured-output test set

Review failures as categories, not anecdotes

Common failure modes

1. Syntax is valid, schema is wrong

2. Enum drift

3. Missing nested fields

4. Mixed prose and JSON

5. Business-rule violations

When to avoid structured output

Python example: schema plus retry loop

A practical production pattern

What this means

Related articles