Generating structured data from unstructured inputs is one of the core use cases for AI. Recent release from OpenAI has introduced a new Structured Outputs feature that allows you to generate JSONs with a given schema. While it promises guarantees of correctness, I was curious to see how well it performs and what are the caveats.
In this post, I benchmark two different OpenAI models across three different methods of JSON generation and analyze the results focusing on error rates, performance and costs.
Disclaimer: This is not a comprehensive benchmark, but rather a quick experiment that aimed to answer a specific question. Most importantly, it’s not focusing on the accuracy of the generated data, but rather on the performance of the LLMs.
Experiment Setup
Task: Generate a JSON object adhering to the schema provided. Think of mocking data for a database.
I tested six LLM configurations across three JSON schema generation methods:
gpt-4o-mini
generating JSONs using Tool Callsgpt-4o-mini
generating JSONs using JSON modegpt-4o-mini
generating JSONs using JSON mode usingstrict = True
gpt-4o-2024-08-06
generating JSONs using Tool Callsgpt-4o-2024-08-06
generating JSONs using JSON modegpt-4o-2024-08-06
generating JSONs using JSON mode usingstrict = True
And three JSON schema complexities:
- Wide JSON: 25 fields with 1 level of nesting
- Complex JSON: 25 fields with 5 levels of nesting
- Super Complex JSON: 100 fields with 5 levels of nesting
I wanna see the Super Complex JSON!
Each configuration was tested 50
times and the average time taken was recorded. Full code used to run the experiment can be found in the GitHub repo.
Full results log:
Results
Complex JSON Schema
Method | Avg Time (ms) | Time Diff (ms) | Success Rate | Cost | Cost Diff |
---|---|---|---|---|---|
gpt-4o-2024-08-06-non-strict-tool | 4079.0854 | 0 | 100.0000% | 0.1680 | +0.1534 |
gpt-4o-mini-non-strict-json | 5847.6183 | +1768.5329 | 100.0000% | 0.0175 | +0.0029 |
gpt-4o-2024-08-06-strict-json | 5866.2200 | +1787.1346 | 100.0000% | 0.1528 | +0.1382 |
gpt-4o-2024-08-06-non-strict-json | 6314.3933 | +2235.3079 | 100.0000% | 0.3026 | +0.2880 |
gpt-4o-mini-strict-json | 7858.5114 | +3779.4260 | 100.0000% | 0.0146 | 0 |
gpt-4o-mini-non-strict-tool | N/A | N/A | 0.0000% | N/A | N/A |
Wide JSON Schema
Method | Avg Time (ms) | Time Diff (ms) | Success Rate | Cost | Cost Diff |
---|---|---|---|---|---|
gpt-4o-mini-non-strict-tool | 2943.3405 | 0 | 100.0000% | 0.0078 | +0.0018 |
gpt-4o-2024-08-06-non-strict-tool | 3149.5288 | +206.1883 | 100.0000% | 0.1313 | +0.1253 |
gpt-4o-mini-non-strict-json | 3603.0182 | +659.6777 | 100.0000% | 0.0115 | +0.0055 |
gpt-4o-2024-08-06-strict-json | 3852.8140 | +909.4735 | 100.0000% | 0.1023 | +0.0963 |
gpt-4o-2024-08-06-non-strict-json | 4065.5266 | +1122.1861 | 100.0000% | 0.1964 | +0.1904 |
gpt-4o-mini-strict-json | 4530.8732 | +1587.5327 | 100.0000% | 0.0060 | 0 |
###s Super Complex JSON Schema
Method | Avg Time (ms) | Time Diff (ms) | Success Rate | Cost | Cost Diff |
---|---|---|---|---|---|
gpt-4o-2024-08-06-strict-json | 10743.0250 | 0 | 100.0000% | 0.3004 | +0.2763 |
gpt-4o-mini-non-strict-json | 12884.3888 | +2141.3638 | 100.0000% | 0.0393 | +0.0152 |
gpt-4o-2024-08-06-non-strict-json | 13041.2693 | +2298.2443 | 100.0000% | 0.6619 | +0.6378 |
gpt-4o-mini-strict-json | 13289.3869 | +2546.3619 | 100.0000% | 0.0241 | 0 |
gpt-4o-2024-08-06-non-strict-tool | N/A | N/A | 0.0000% | N/A | N/A |
gpt-4o-mini-non-strict-tool | N/A | N/A | 0.0000% | N/A | N/A |
Key Findings
Strict vs Non-Strict: A Performance Showdown
Our benchmark revealed interesting insights into the performance differences between strict and non-strict modes across various JSON complexities:
Wide JSON Schema
In the simplest scenario, non-strict modes generally outperformed their strict counterparts:
- The fastest performer was
gpt-4o-mini
with tool calls (2943.3405 ms) gpt-4o-mini
in Strict Mode was significantly slower (4530.8732 ms)gpt-4o-2024-08-06
in Strict Mode (3852.8140 ms) was outpaced by its non-strict counterparts (3149.5288 ms with tool calls, 4065.5266 ms for JSON generation)
However, all configurations achieved a 100% success rate, indicating that strictness might be overkill for simple structures.
Complex JSON Schema
As complexity increased, the performance gap narrowed:
gpt-4o-2024-08-06
with tool calls led (4079.0854 ms)gpt-4o-2024-08-06
in Strict Mode followed closely (5866.2200 ms)- Notably,
gpt-4o-mini
with tool calls failed completely, while its strict counterpart succeeded
This suggests that strictness becomes more valuable as JSON complexity increases, especially for less advanced models.
Super Complex JSON Schema
In the most challenging scenario, strict modes shined:
gpt-4o-2024-08-06
in Strict Mode was the top performer- Both non-strict tool methods failed entirely
- Strict JSON methods maintained 100% success rates
- Surprisingly, smaller model in strict mode was even slower than the bigger model in strict mode
This underscores the critical importance of strictness in handling highly complex JSON structures.
The Cold Start in Strict Mode
As Ted Sanders mentioned in this HN comment, using strict mode bears a significant cold start penalty which goes away in the subsequent runs.
The first request with each JSON schema will be slow, as we need to preprocess the JSON schema into a context-free grammar. If you don’t want that latency hit (e.g., you’re prototyping, or have a use case that uses variable one-off schemas), then you might prefer “strict”: false
How much slower it is? Here are my results:
Model | schema | avgFirstRequestTime | avgSecondRequestTime | coldStartPenalty |
---|---|---|---|---|
gpt-4o-2024-08-06 | Wide JSON Schema | 20234.0549 | 5927.3556 | 241.37% |
gpt-4o-mini | Wide JSON Schema | 21801.5501 | 5800.8192 | 275.84% |
gpt-4o-2024-08-06 | Complex JSON Schema | 24089.9075 | 7100.4283 | 239.27% |
gpt-4o-mini | Complex JSON Schema | 26665.4039 | 10270.7880 | 159.62% |
gpt-4o-2024-08-06 | Super Complex JSON Schema | 60481.4465 | 11698.9430 | 416.98% |
gpt-4o-mini | Super Complex JSON Schema | 66011.3763 | 13994.1616 | 371.71% |
The more complex the JSON schema, the more painful the cold start penalty becomes.
Strictness and Cost: An Unexpected Relationship
Interestingly, the impact of strictness on cost varied between model versions:
- For the
gpt-4o-mini
model, strict mode was generally cheaper (e.g., $0.0060 vs $0.0115 for Wide JSON) - For the
2024-08-06
model, strict mode was more cost-effective in all the cases
Caveats
While strict mode is superior for more complicated cases, it has limitations that might be disqualifying it from using it in your case:
- Not all JSONSchema types & features are supported in strict mode. Things like
allOf
,oneOf
,not
,definitions
,min
,pattern
, and most importantlyformat
are not supported - JSON can be nested only up to 5 levels and have up to 100 properties
additionalProperties
is not supported- All fields are required (but you can make them optional by doing the
"type": ["string", "null"]
trick)
My Practical Recommendations
Based on my findings, I recommend the following approaches:
-
For Simple JSON Structures:
- Prefer non-strict modes, especially tool-based methods for speed and cost-effectiveness
- Go with smaller mini model if you can (but don’t forget about potential failures, wrap in try/catch accordingly)
-
For Moderately Complex JSON:
- Use non-strict modes with more advanced models (e.g., gpt-4o-2024-08-06 with tool calls)
- Always validate the output using Zod schemas
-
For Highly Complex JSON:
- Strict modes are essential for reliability and success
- Use the most advanced model available (e.g.,
gpt-4o-2024-08-06
in Strict Mode) - Be prepared for very significant cold start penalties (goes away in the subsequent runs, neglible when running multiple times in a row)
-
Cost Considerations:
- For simpler tasks, non-strict modes are generally more cost-effective
- For complex tasks, strict modes can be more cost-effective, especially with advanced models
Found this insightful? If you're interested in my AI consulting, please reach out to me via email or on X