structured outputJSONprompt engineering

Structured output: getting reliable JSON from any LLM (2026)

Why structured outputs matter, how JSON mode and schema enforcement differ, and practical patterns for getting reliable JSON from LLMs in production.

By Knovo Team2026-04-1211 min readLast verified 2026-04-12

Structured output is where a lot of LLM prototypes stop being demos and start becoming software.

It is easy to build a chatbot that returns fluent text. It is much harder to build a production system that emits JSON your application can trust. Pipelines break on missing keys, extra prose, malformed arrays, trailing commas, wrong enum values, and schema drift across retries. The model may look "mostly correct" to a human and still be unusable to code.

That is why structured output is not a prompt trick. It is an interface design problem. If an LLM sits inside a workflow that triggers actions, writes to a database, populates UI components, or calls downstream services, free-form prose is usually the wrong contract. What you need is validated structure.

Why unstructured output breaks pipelines

Humans are tolerant of ambiguity. Software is not.

If a support summarizer returns:

Customer is upset about delayed refund.
Priority: high
Recommended next step: escalate to billing

that may be perfectly readable to a person. But it is not enough for a system that expects:

{
  "issue_type": "refund_delay",
  "priority": "high",
  "next_action": "escalate_billing"
}

The gap matters because text that is "close enough" for a human is still broken for code.

Typical failure modes from unstructured output:

  1. The model wraps JSON in Markdown fences
  2. The model adds an explanation before or after the JSON
  3. A required field is missing
  4. A field uses the wrong type
  5. An enum value is semantically correct but not one of the allowed values
  6. Nested objects drift from the schema over time

That is why production systems should stop asking, "Did the model answer well?" and start asking, "Did the model satisfy the contract?"

Structured Output Production PipelinePrompt+ SchemaLLMJSONOutputSchemaValidateTypedData ✓retry with validation error
Schema-constrained generation, application-side validation, and retry loop on failure

JSON mode vs constrained decoding vs schema validation

These three ideas are related, but they are not the same thing.

JSON mode

JSON mode tells the model to output valid JSON syntax. That is useful, but limited.

If you only use JSON mode, you may get:

  1. Valid JSON with the wrong keys
  2. Valid JSON with the wrong types
  3. Valid JSON with missing required fields
  4. Valid JSON that does not match your application schema

JSON mode solves syntax reliability better than semantic reliability. It is helpful, but it is not full schema enforcement.

Constrained decoding

Constrained decoding restricts what the model is allowed to generate so the output conforms more tightly to a schema or grammar.

This is much stronger than "please return JSON." It reduces the space of possible outputs during generation itself. In practice, that means fewer malformed responses and less post-processing work.

When people say structured outputs are now much more reliable, this is often what they mean. The model is not merely encouraged to follow a schema. The decoding path is constrained around it.

Schema validation

Schema validation happens after generation. You parse the model output and check it against a schema definition, often JSON Schema or a typed model such as Pydantic.

Validation is essential even if you use vendor-native structured output features, because:

  1. Your application still needs a typed object
  2. You may support multiple model vendors
  3. Some workflows involve fallback or repair steps
  4. Validation lets you detect drift explicitly instead of guessing

In production, the safest pattern is usually not one of these in isolation. It is a layered approach:

  1. Ask for structure using the strongest vendor-native feature available
  2. Validate the result against your application schema
  3. Retry or repair if validation fails

OpenAI Structured Outputs API

OpenAI's structured output support is the cleanest version of schema-first generation in a hosted API today.

The official OpenAI docs describe Structured Outputs as a way to supply a JSON Schema so the model returns data matching that schema, rather than simply "some JSON." This is stronger than old-style JSON mode because the schema becomes part of the generation contract.

Why this matters in practice

This changes the development workflow:

  1. Define the schema first
  2. Let the model generate into that contract
  3. Parse the result as typed data

That is much better than generating free-form text and trying to recover structure with regexes or fragile parsers afterward.

What OpenAI's approach is good at

OpenAI's current structured-output approach is especially strong when:

  1. Your schema is known up front
  2. You need deterministic application fields
  3. The output will flow directly into code
  4. You want less repair logic in the happy path

It is a good fit for extraction, classification, route planning, action arguments, compliance flags, UI configs, and other machine-consumed outputs.

Where it still needs engineering

Even with Structured Outputs, you still need:

  1. Sensible schemas
  2. Validation on the application side
  3. Retry logic for provider or transport failures
  4. Monitoring around schema success rate

Strong vendor support reduces errors. It does not remove the need for production discipline.

Anthropic tool use for schema enforcement

Anthropic's strongest structured-output pattern is tool use.

In the official tool-use docs, Anthropic defines tools with an input_schema field using JSON Schema. In practice, this means you can describe the shape of the tool arguments you want, and the model responds by producing tool input that matches that schema instead of improvising prose.

This is not exactly the same product surface as OpenAI's Structured Outputs, but the practical result is similar: you get structured, machine-usable data with a schema contract.

Why tool use is a strong structured-output pattern

Anthropic's tool pattern is especially good because it aligns structure with action.

If the model is going to:

  1. create a ticket
  2. classify a message
  3. extract entities
  4. plan a workflow step

then tool arguments are often the cleanest place to enforce schema. The output is not "text about the action." It is the typed payload for the action.

When Anthropic tool use is the right choice

This pattern is strongest when:

  1. The result naturally maps to a tool or function call
  2. You want schema enforcement plus action routing
  3. You are already building agent-style workflows
  4. Your application treats structure as input to another system

If your use case is extraction without downstream tools, you can still use tool definitions as a structured-output interface. That often works better than trying to parse plain assistant text.

The Pydantic + Instructor pattern

Vendor-native structured output is useful, but many teams want a cross-provider application layer. That is where the Pydantic + Instructor pattern is valuable.

Instructor is a Python library built around typed output models. Its official docs describe using Pydantic models as the desired response shape and letting the library handle parsing and retries around provider calls.

This is a strong pattern because it moves your real contract into application code.

Why Pydantic helps

Pydantic gives you:

  1. Required fields
  2. Type validation
  3. Enums
  4. Nested models
  5. Useful validation errors

That means your schema is not scattered across prompt text. It lives in code, where the rest of your application can depend on it.

Why Instructor helps

Instructor adds a practical wrapper around this pattern by:

  1. Calling the model
  2. Parsing the response into the Pydantic model
  3. Retrying when validation fails

That is exactly the sort of glue production teams end up writing anyway. Using a library for it reduces repeated plumbing.

When this pattern is best

The Pydantic + Instructor pattern is especially useful when:

  1. You support multiple vendors
  2. Your typed schema matters more than a provider-specific feature surface
  3. You want validation and retries unified in Python
  4. Your backend already uses Pydantic models

It does not replace vendor-native features. It gives you a strong application layer over them.

Schema design matters more than people think

Many structured-output problems are not really model problems. They are schema problems.

If your schema is confusing, over-nested, inconsistent, or semantically vague, even a strong model will struggle to produce clean output. Teams often blame the provider when the deeper issue is that the contract itself is badly designed.

Keep schemas narrow

A production schema should be as small as the workflow allows.

Every extra field creates more failure surface:

  1. another place for enum drift
  2. another missing-value risk
  3. another business rule to validate
  4. another prompt burden on the model

This is why "return everything we might ever want" is usually a bad idea. The best schema is often not the most complete one. It is the smallest one that supports the downstream system.

Prefer explicit enums over vague text

If the application needs routing logic, do not ask for open-ended descriptions where a controlled enum would do.

Weak:

{"priority_reason": "This looks pretty urgent"}

Stronger:

{"priority": "high"}

You can still keep a free-text explanation field if humans need it. But machine logic should depend on constrained values whenever possible.

Separate machine fields from human explanation

One of the cleanest production patterns is to split outputs into:

  1. fields the system needs for logic
  2. fields humans may want for explanation

For example:

  1. route = "billing"
  2. confidence = "high"
  3. explanation = "The user mentions duplicate charges and refund delay."

That pattern keeps the machine-facing contract stable without sacrificing interpretability for operators or reviewers.

Avoid deep nesting unless the domain truly needs it

Deeply nested JSON looks elegant in design docs and often behaves badly in practice.

If the model has to build three layers of nested arrays and objects, failures increase:

  1. required children go missing
  2. ordering becomes inconsistent
  3. retries become harder to reason about

Flatten where you can. Nest only where it reflects the actual domain model and reduces ambiguity.

Retry and repair loops

Even with schema-aware generation, structured output should be treated as a reliability pipeline, not a single API call.

Why? Because failures still happen:

  1. Provider responses time out
  2. The model chooses a near-miss enum value
  3. A field violates a business rule the JSON Schema did not capture
  4. An upstream prompt change causes subtle drift

That is where retry and repair loops come in.

Retry

A retry loop asks the model again, usually with:

  1. the original task
  2. the validation error
  3. the same schema

This works best when the first failure is a minor mismatch rather than a fundamental misunderstanding of the task.

Repair

A repair loop tries to fix malformed or semantically wrong output using either:

  1. a cheaper model
  2. the same model with a narrower instruction
  3. deterministic code when the issue is simple

Repair is useful when the output is close enough to salvage. For example:

  1. strip Markdown fences
  2. coerce simple types
  3. map synonymous enum labels
  4. fill missing optional fields with defaults

The danger is over-repair. If you repair too aggressively, you can hide real model failures instead of surfacing them.

The best production approach is usually:

  1. parse strictly
  2. repair only small, well-understood issues
  3. retry on true schema failures
  4. alert if failure rates climb

Observability and testing for structured output

Structured output should be monitored like any other production interface.

If a normal API started returning malformed payloads 6% of the time, nobody would call that acceptable. LLM output contracts should be held to the same standard.

Metrics that actually matter

The most useful metrics are usually:

  1. schema pass rate
  2. retry rate
  3. repair rate
  4. business-rule failure rate
  5. average attempts per successful call

These metrics help you distinguish between prompt issues, model drift, schema design problems, and provider instability.

For example:

  1. high parse failures usually mean syntax or contract issues
  2. high business-rule failures usually mean the schema is too weak
  3. rising retry rates after a prompt edit usually mean the prompt shifted the model away from the contract

Keep a structured-output test set

The easiest way to break a production extractor is to change the prompt or model without a test set.

A good regression set should include:

  1. normal cases
  2. edge cases
  3. ambiguous inputs
  4. adversarial formatting
  5. known failure examples from production

Then test two things separately:

  1. schema validity
  2. semantic correctness

Those are not the same. A response can validate perfectly and still classify the input incorrectly.

Review failures as categories, not anecdotes

When structured outputs fail, do not just fix the single broken sample and move on. Label the failure type:

  1. syntax
  2. schema shape
  3. enum drift
  4. business rule
  5. misunderstanding of task

That turns debugging into system improvement instead of prompt superstition.

Common failure modes

Structured output failures are predictable. That is good news because predictable failures can be instrumented.

1. Syntax is valid, schema is wrong

This is the classic JSON mode problem. The model returns valid JSON, but the shape is wrong.

Fix:

  1. Use schema-constrained generation when available
  2. Validate with Pydantic or JSON Schema
  3. Fail fast instead of silently accepting drift

2. Enum drift

The model returns "urgent" when the schema expects "high".

Fix:

  1. Use explicit enums
  2. Show examples when the categories are subtle
  3. Add a narrow repair map only for safe synonyms

3. Missing nested fields

The outer object is correct, but nested structure is incomplete.

Fix:

  1. Keep nesting only as deep as necessary
  2. Mark truly required fields as required
  3. Retry with the validation error attached

4. Mixed prose and JSON

The model adds "Here is the JSON:" before the object or wraps it in code fences.

Fix:

  1. Prefer vendor-native structured output features
  2. Strip code fences only as a first-pass repair, not as the main solution
  3. Treat repeated occurrences as a prompt or provider-contract problem

5. Business-rule violations

The JSON matches the schema, but the result is still wrong for your application. For example, the date format is valid but outside an allowed range.

Fix:

  1. Separate schema validation from business validation
  2. Feed business-validation errors back into retries
  3. Log business-rule failure rates independently
Structured Output Failure ModesFailure ModeFixSyntax valid, wrong schemaUse schema-constrained generationEnum drift ("urgent" vs "high")Explicit enums + repair map for synonymsMissing nested fieldsFlatten depth + retry with validation errorMixed prose + JSON outputVendor-native structured output featuresBusiness-rule violationsSeparate schema vs business-rule checks
Five predictable failure modes and targeted production fixes

When to avoid structured output

Not every LLM response should be forced into JSON.

Structured output is the right tool when the downstream consumer is code. It is often the wrong tool when the downstream consumer is a human who needs flexible narrative explanation.

Avoid or minimize structured output when:

  1. The task is exploratory and open-ended
  2. The user mainly wants a narrative answer
  3. The structure would be mostly empty or artificial
  4. You are forcing prose into JSON just to feel "production-ready"

This matters because over-structuring has a cost:

  1. More prompt complexity
  2. More schema maintenance
  3. More retries
  4. Less expressive answers

Sometimes the best pattern is hybrid:

  1. structured object for machine logic
  2. optional free-text explanation for humans

That gives your application what it needs without pretending every task is naturally a schema.

Python example: schema plus retry loop

This example shows the core production pattern:

  1. define a schema in code
  2. call the model
  3. validate the result
  4. retry with the validation error if needed
# pip install -U openai pydantic
from typing import Literal
from pydantic import BaseModel, ValidationError
from openai import OpenAI
 
client = OpenAI()
 
class TicketSummary(BaseModel):
    issue_type: Literal["refund_delay", "bug_report", "account_access", "other"]
    priority: Literal["low", "medium", "high"]
    next_action: str
    customer_sentiment: Literal["negative", "neutral", "positive"]
 
def extract_ticket_summary(message: str, max_retries: int = 2) -> TicketSummary:
    errors = None
 
    for attempt in range(max_retries + 1):
        repair_note = ""
        if errors:
            repair_note = f"\nPrevious validation error: {errors}\nReturn corrected JSON only."
 
        response = client.responses.create(
            model="gpt-5.4-mini",
            text={
                "format": {
                    "type": "json_schema",
                    "name": "ticket_summary",
                    "schema": {
                        "type": "object",
                        "properties": {
                            "issue_type": {
                                "type": "string",
                                "enum": ["refund_delay", "bug_report", "account_access", "other"]
                            },
                            "priority": {
                                "type": "string",
                                "enum": ["low", "medium", "high"]
                            },
                            "next_action": {"type": "string"},
                            "customer_sentiment": {
                                "type": "string",
                                "enum": ["negative", "neutral", "positive"]
                            }
                        },
                        "required": [
                            "issue_type",
                            "priority",
                            "next_action",
                            "customer_sentiment"
                        ],
                        "additionalProperties": False
                    },
                    "strict": True
                }
            },
            input=(
                "Extract a structured support summary from this message.\n"
                f"Message: {message}"
                f"{repair_note}"
            ),
        )
 
        try:
            return TicketSummary.model_validate_json(response.output_text)
        except ValidationError as exc:
            errors = str(exc)
 
    raise ValueError(f"Could not produce valid structured output after retries: {errors}")
 
summary = extract_ticket_summary(
    "I was charged twice and support still hasn't fixed the refund after five days."
)
print(summary.model_dump())

This example uses OpenAI's schema-based output format, but the reliability pattern is portable. The same application structure works with Anthropic tool inputs or a Pydantic + Instructor wrapper.

A practical production pattern

If you want one durable rule for structured output in production, use this:

  1. Define the schema in code
  2. Use the strongest native structured-output feature your provider offers
  3. Validate in the application anyway
  4. Retry with validation errors when useful
  5. Monitor schema success rate like any other production metric

That is the difference between "the model usually gives JSON" and "the system reliably produces typed data."

What this means

Reliable JSON is not a prompt flourish. It is how you make LLMs interoperable with the rest of your software stack.

If the model output is going to trigger logic, write records, feed workflows, render UI, or call tools, structure should be the default assumption. The right question is not whether you can get JSON once in a notebook. It is whether your system can keep getting valid, typed, enforceable output after prompt changes, model upgrades, and edge-case inputs.

That is why the winning production pattern is layered: native structure when available, schema validation in code, and explicit retry or repair when needed. Once you treat structured output as a contract instead of a suggestion, LLM systems become much easier to debug, test, and trust.

Related articles