Generating structured data from unstructured inputs is one of the core use cases for AI. Recent release from OpenAI has introduced a new Structured Outputs feature that allows you to generate JSONs with a given schema. While it promises guarantees of correctness, I was curious to see how well it performs and what are the caveats.
In this post, I benchmark two different OpenAI models across three different methods of JSON generation and analyze the results focusing on error rates, performance and costs.
Disclaimer: This is not a comprehensive benchmark, but rather a quick experiment that aimed to answer a specific question. Most importantly, it’s not focusing on the accuracy of the generated data, but rather on the performance of the LLMs.
Experiment Setup
Task: Generate a JSON object adhering to the schema provided. Think of mocking data for a database.
I tested six LLM configurations across three JSON schema generation methods:
gpt-4o-mini
generating JSONs using Tool Callsgpt-4o-mini
generating JSONs using JSON modegpt-4o-mini
generating JSONs using JSON mode usingstrict = True
gpt-4o-2024-08-06
generating JSONs using Tool Callsgpt-4o-2024-08-06
generating JSONs using JSON modegpt-4o-2024-08-06
generating JSONs using JSON mode usingstrict = True
And three JSON schema complexities:
- Wide JSON: 25 fields with 1 level of nesting
- Complex JSON: 25 fields with 5 levels of nesting
- Super Complex JSON: 100 fields with 5 levels of nesting
I wanna see the Super Complex JSON!
export const superComplexJsonSchema = z.object({ id: z.number().int(), name: z.string(), details: z.object({ personalInfo: z.object({ age: z.number().int(), dateOfBirth: z.string(), nationality: z.string(), maritalStatus: z.enum(["single", "married", "divorced", "widowed"]), contact: z.object({ email: z.string(), phone: z.string(), alternativePhone: z.string(), address: z.object({ street: z.string(), city: z.string(), state: z.string(), country: z.string(), postalCode: z.string(), latitude: z.number(), longitude: z.number(), }), }), }), professionalInfo: z.object({ occupation: z.string(), currentEmployer: z.string(), yearsOfExperience: z.number().int(), education: z.array( z.object({ institution: z.string(), degree: z.string(), fieldOfStudy: z.string(), graduationYear: z.number().int(), gpa: z.number(), }), ), certifications: z.array( z.object({ name: z.string(), issuingOrganization: z.string(), dateObtained: z.string(), expirationDate: z.string(), }), ), experience: z.array( z.object({ company: z.string(), position: z.string(), startDate: z.string(), endDate: z.string(), isCurrent: z.boolean(), responsibilities: z.array(z.string()), skills: z.array(z.string()), reportsTo: z.string(), }), ), skills: z.array( z.object({ name: z.string(), category: z.string(), level: z.enum(["beginner", "intermediate", "advanced", "expert"]), yearsOfExperience: z.number(), }), ), languages: z.array( z.object({ language: z.string(), proficiency: z.enum(["basic", "conversational", "fluent", "native"]), certifications: z.array(z.string()), }), ), }), }), preferences: z.object({ favoriteColors: z.array(z.string()), hobbies: z.array( z.object({ name: z.string(), category: z.string(), frequency: z.string(), yearsOfExperience: z.number(), relatedSkills: z.array(z.string()), }), ), travelPreferences: z.object({ accommodationType: z.enum(["hotel", "hostel", "airbnb", "camping"]), budgetPerDay: z.number(), preferredTransportation: z.array(z.string()), }), dietaryRestrictions: z.array(z.string()), workPreferences: z.object({ preferredWorkEnvironment: z.enum(["office", "remote", "hybrid"]), desiredSalary: z.number(), willingToRelocate: z.boolean(), preferredIndustries: z.array(z.string()), }), }), financialInfo: z.object({ income: z.number(), expenses: z.number(), assets: z.array( z.object({ type: z.string(), value: z.number(), purchaseDate: z.string(), }), ), liabilities: z.array( z.object({ type: z.string(), amount: z.number(), interestRate: z.number(), }), ), }), healthInfo: z.object({ height: z.number(), weight: z.number(), medications: z.array( z.object({ name: z.string(), dosage: z.string(), frequency: z.string(), }), ), chronicConditions: z.array(z.string()), lastCheckup: z.string(), }), metadata: z.object({ createdAt: z.string(), lastUpdated: z.string(), dataSource: z.string(), accessLevel: z.enum(["public", "private", "restricted"]), tags: z.array(z.string()), }),});
Each configuration was tested 50
times and the average time taken was recorded. Full code used to run the experiment can be found in the GitHub repo.
Full results log:
Report for schema: Complex JSON Schema
Methods sorted by performance (fastest to slowest):1. gpt-4o-2024-08-06-non-strict-tool (Complex JSON Schema) Average time: 4079.0854 ms Success rate: 100.0000% Cost: 0.16802. gpt-4o-mini-non-strict-json (Complex JSON Schema) Average time: 5847.6183 ms Success rate: 100.0000% Cost: 0.01753. gpt-4o-2024-08-06-strict-json (Complex JSON Schema) Average time: 5866.2200 ms Success rate: 100.0000% Cost: 0.15284. gpt-4o-2024-08-06-non-strict-json (Complex JSON Schema) Average time: 6314.3933 ms Success rate: 100.0000% Cost: 0.30265. gpt-4o-mini-strict-json (Complex JSON Schema) Average time: 7858.5114 ms Success rate: 100.0000% Cost: 0.01466. gpt-4o-mini-non-strict-tool (Complex JSON Schema) Average time: 0.0000 ms Success rate: 0.0000% Cost: 0.0000
Methods sorted by cost (cheapest to most expensive):1. gpt-4o-mini-strict-json (Complex JSON Schema) Cost: 0.0146 Average time: 7858.5114 ms Success rate: 100.0000%2. gpt-4o-mini-non-strict-json (Complex JSON Schema) Cost: 0.0175 Average time: 5847.6183 ms Success rate: 100.0000%3. gpt-4o-2024-08-06-strict-json (Complex JSON Schema) Cost: 0.1528 Average time: 5866.2200 ms Success rate: 100.0000%4. gpt-4o-2024-08-06-non-strict-tool (Complex JSON Schema) Cost: 0.1680 Average time: 4079.0854 ms Success rate: 100.0000%5. gpt-4o-2024-08-06-non-strict-json (Complex JSON Schema) Cost: 0.3026 Average time: 6314.3933 ms Success rate: 100.0000%6. gpt-4o-mini-non-strict-tool (Complex JSON Schema) Cost: 0.0000 Average time: 0.0000 ms Success rate: 0.0000%
Report for schema: Wide JSON Schema
Methods sorted by performance (fastest to slowest):1. gpt-4o-mini-non-strict-tool (Wide JSON Schema) Average time: 2943.3405 ms Success rate: 100.0000% Cost: 0.00782. gpt-4o-2024-08-06-non-strict-tool (Wide JSON Schema) Average time: 3149.5288 ms Success rate: 100.0000% Cost: 0.13133. gpt-4o-mini-non-strict-json (Wide JSON Schema) Average time: 3603.0182 ms Success rate: 100.0000% Cost: 0.01154. gpt-4o-2024-08-06-strict-json (Wide JSON Schema) Average time: 3852.8140 ms Success rate: 100.0000% Cost: 0.10235. gpt-4o-2024-08-06-non-strict-json (Wide JSON Schema) Average time: 4065.5266 ms Success rate: 100.0000% Cost: 0.19646. gpt-4o-mini-strict-json (Wide JSON Schema) Average time: 4530.8732 ms Success rate: 100.0000% Cost: 0.0060
Methods sorted by cost (cheapest to most expensive):1. gpt-4o-mini-strict-json (Wide JSON Schema) Cost: 0.0060 Average time: 4530.8732 ms Success rate: 100.0000%2. gpt-4o-mini-non-strict-tool (Wide JSON Schema) Cost: 0.0078 Average time: 2943.3405 ms Success rate: 100.0000%3. gpt-4o-mini-non-strict-json (Wide JSON Schema) Cost: 0.0115 Average time: 3603.0182 ms Success rate: 100.0000%4. gpt-4o-2024-08-06-strict-json (Wide JSON Schema) Cost: 0.1023 Average time: 3852.8140 ms Success rate: 100.0000%5. gpt-4o-2024-08-06-non-strict-tool (Wide JSON Schema) Cost: 0.1313 Average time: 3149.5288 ms Success rate: 100.0000%6. gpt-4o-2024-08-06-non-strict-json (Wide JSON Schema) Cost: 0.1964 Average time: 4065.5266 ms Success rate: 100.0000%
Report for schema: Super Complex JSON Schema
Methods sorted by performance (fastest to slowest):1. gpt-4o-2024-08-06-strict-json (Super Complex JSON Schema) Average time: 10743.0250 ms Success rate: 100.0000% Cost: 0.30042. gpt-4o-mini-non-strict-json (Super Complex JSON Schema) Average time: 12884.3888 ms Success rate: 100.0000% Cost: 0.03933. gpt-4o-2024-08-06-non-strict-json (Super Complex JSON Schema) Average time: 13041.2693 ms Success rate: 100.0000% Cost: 0.66194. gpt-4o-mini-strict-json (Super Complex JSON Schema) Average time: 13289.3869 ms Success rate: 100.0000% Cost: 0.02415. gpt-4o-2024-08-06-non-strict-tool (Super Complex JSON Schema) Average time: 0.0000 ms Success rate: 0.0000% Cost: 0.00006. gpt-4o-mini-non-strict-tool (Super Complex JSON Schema) Average time: 0.0000 ms Success rate: 0.0000% Cost: 0.0000
Methods sorted by cost (cheapest to most expensive):1. gpt-4o-mini-strict-json (Super Complex JSON Schema) Cost: 0.0241 Average time: 13289.3869 ms Success rate: 100.0000%2. gpt-4o-mini-non-strict-json (Super Complex JSON Schema) Cost: 0.0393 Average time: 12884.3888 ms Success rate: 100.0000%3. gpt-4o-2024-08-06-strict-json (Super Complex JSON Schema) Cost: 0.3004 Average time: 10743.0250 ms Success rate: 100.0000%4. gpt-4o-2024-08-06-non-strict-json (Super Complex JSON Schema) Cost: 0.6619 Average time: 13041.2693 ms Success rate: 100.0000%5. gpt-4o-2024-08-06-non-strict-tool (Super Complex JSON Schema) Cost: 0.0000 Average time: 0.0000 ms Success rate: 0.0000%6. gpt-4o-mini-non-strict-tool (Super Complex JSON Schema) Cost: 0.0000 Average time: 0.0000 ms Success rate: 0.0000%
Results
Complex JSON Schema
Method | Avg Time (ms) | Time Diff (ms) | Success Rate | Cost | Cost Diff |
---|---|---|---|---|---|
gpt-4o-2024-08-06-non-strict-tool | 4079.0854 | 0 | 100.0000% | 0.1680 | +0.1534 |
gpt-4o-mini-non-strict-json | 5847.6183 | +1768.5329 | 100.0000% | 0.0175 | +0.0029 |
gpt-4o-2024-08-06-strict-json | 5866.2200 | +1787.1346 | 100.0000% | 0.1528 | +0.1382 |
gpt-4o-2024-08-06-non-strict-json | 6314.3933 | +2235.3079 | 100.0000% | 0.3026 | +0.2880 |
gpt-4o-mini-strict-json | 7858.5114 | +3779.4260 | 100.0000% | 0.0146 | 0 |
gpt-4o-mini-non-strict-tool | N/A | N/A | 0.0000% | N/A | N/A |
Wide JSON Schema
Method | Avg Time (ms) | Time Diff (ms) | Success Rate | Cost | Cost Diff |
---|---|---|---|---|---|
gpt-4o-mini-non-strict-tool | 2943.3405 | 0 | 100.0000% | 0.0078 | +0.0018 |
gpt-4o-2024-08-06-non-strict-tool | 3149.5288 | +206.1883 | 100.0000% | 0.1313 | +0.1253 |
gpt-4o-mini-non-strict-json | 3603.0182 | +659.6777 | 100.0000% | 0.0115 | +0.0055 |
gpt-4o-2024-08-06-strict-json | 3852.8140 | +909.4735 | 100.0000% | 0.1023 | +0.0963 |
gpt-4o-2024-08-06-non-strict-json | 4065.5266 | +1122.1861 | 100.0000% | 0.1964 | +0.1904 |
gpt-4o-mini-strict-json | 4530.8732 | +1587.5327 | 100.0000% | 0.0060 | 0 |
###s Super Complex JSON Schema
Method | Avg Time (ms) | Time Diff (ms) | Success Rate | Cost | Cost Diff |
---|---|---|---|---|---|
gpt-4o-2024-08-06-strict-json | 10743.0250 | 0 | 100.0000% | 0.3004 | +0.2763 |
gpt-4o-mini-non-strict-json | 12884.3888 | +2141.3638 | 100.0000% | 0.0393 | +0.0152 |
gpt-4o-2024-08-06-non-strict-json | 13041.2693 | +2298.2443 | 100.0000% | 0.6619 | +0.6378 |
gpt-4o-mini-strict-json | 13289.3869 | +2546.3619 | 100.0000% | 0.0241 | 0 |
gpt-4o-2024-08-06-non-strict-tool | N/A | N/A | 0.0000% | N/A | N/A |
gpt-4o-mini-non-strict-tool | N/A | N/A | 0.0000% | N/A | N/A |
Key Findings
Strict vs Non-Strict: A Performance Showdown
Our benchmark revealed interesting insights into the performance differences between strict and non-strict modes across various JSON complexities:
Wide JSON Schema
In the simplest scenario, non-strict modes generally outperformed their strict counterparts:
- The fastest performer was
gpt-4o-mini
with tool calls (2943.3405 ms) gpt-4o-mini
in Strict Mode was significantly slower (4530.8732 ms)gpt-4o-2024-08-06
in Strict Mode (3852.8140 ms) was outpaced by its non-strict counterparts (3149.5288 ms with tool calls, 4065.5266 ms for JSON generation)
However, all configurations achieved a 100% success rate, indicating that strictness might be overkill for simple structures.
Complex JSON Schema
As complexity increased, the performance gap narrowed:
gpt-4o-2024-08-06
with tool calls led (4079.0854 ms)gpt-4o-2024-08-06
in Strict Mode followed closely (5866.2200 ms)- Notably,
gpt-4o-mini
with tool calls failed completely, while its strict counterpart succeeded
This suggests that strictness becomes more valuable as JSON complexity increases, especially for less advanced models.
Super Complex JSON Schema
In the most challenging scenario, strict modes shined:
gpt-4o-2024-08-06
in Strict Mode was the top performer- Both non-strict tool methods failed entirely
- Strict JSON methods maintained 100% success rates
- Surprisingly, smaller model in strict mode was even slower than the bigger model in strict mode
This underscores the critical importance of strictness in handling highly complex JSON structures.
The Cold Start in Strict Mode
As Ted Sanders mentioned in this HN comment, using strict mode bears a significant cold start penalty which goes away in the subsequent runs.
The first request with each JSON schema will be slow, as we need to preprocess the JSON schema into a context-free grammar. If you don’t want that latency hit (e.g., you’re prototyping, or have a use case that uses variable one-off schemas), then you might prefer “strict”: false
How much slower it is? Here are my results:
Model | schema | avgFirstRequestTime | avgSecondRequestTime | coldStartPenalty |
---|---|---|---|---|
gpt-4o-2024-08-06 | Wide JSON Schema | 20234.0549 | 5927.3556 | 241.37% |
gpt-4o-mini | Wide JSON Schema | 21801.5501 | 5800.8192 | 275.84% |
gpt-4o-2024-08-06 | Complex JSON Schema | 24089.9075 | 7100.4283 | 239.27% |
gpt-4o-mini | Complex JSON Schema | 26665.4039 | 10270.7880 | 159.62% |
gpt-4o-2024-08-06 | Super Complex JSON Schema | 60481.4465 | 11698.9430 | 416.98% |
gpt-4o-mini | Super Complex JSON Schema | 66011.3763 | 13994.1616 | 371.71% |
The more complex the JSON schema, the more painful the cold start penalty becomes.
Strictness and Cost: An Unexpected Relationship
Interestingly, the impact of strictness on cost varied between model versions:
- For the
gpt-4o-mini
model, strict mode was generally cheaper (e.g., $0.0060 vs $0.0115 for Wide JSON) - For the
2024-08-06
model, strict mode was more cost-effective in all the cases
Caveats
While strict mode is superior for more complicated cases, it has limitations that might be disqualifying it from using it in your case:
- Not all JSONSchema types & features are supported in strict mode. Things like
allOf
,oneOf
,not
,definitions
,min
,pattern
, and most importantlyformat
are not supported - JSON can be nested only up to 5 levels and have up to 100 properties
additionalProperties
is not supported- All fields are required (but you can make them optional by doing the
"type": ["string", "null"]
trick)
My Practical Recommendations
Based on my findings, I recommend the following approaches:
-
For Simple JSON Structures:
- Prefer non-strict modes, especially tool-based methods for speed and cost-effectiveness
- Go with smaller mini model if you can (but don’t forget about potential failures, wrap in try/catch accordingly)
-
For Moderately Complex JSON:
- Use non-strict modes with more advanced models (e.g., gpt-4o-2024-08-06 with tool calls)
- Always validate the output using Zod schemas
-
For Highly Complex JSON:
- Strict modes are essential for reliability and success
- Use the most advanced model available (e.g.,
gpt-4o-2024-08-06
in Strict Mode) - Be prepared for very significant cold start penalties (goes away in the subsequent runs, neglible when running multiple times in a row)
-
Cost Considerations:
- For simpler tasks, non-strict modes are generally more cost-effective
- For complex tasks, strict modes can be more cost-effective, especially with advanced models
Found this insightful? If you're interested in my AI consulting, please reach out to me via email or on X