skip to content
Rafal Wilinski

Benchmarking OpenAI LLMs for JSON Generation

/ 9 min read

llm, ai, json

Generating structured data from unstructured inputs is one of the core use cases for AI. Recent release from OpenAI has introduced a new Structured Outputs feature that allows you to generate JSONs with a given schema. While it promises guarantees of correctness, I was curious to see how well it performs and what are the caveats.

In this post, I benchmark two different OpenAI models across three different methods of JSON generation and analyze the results focusing on error rates, performance and costs.

Disclaimer: This is not a comprehensive benchmark, but rather a quick experiment that aimed to answer a specific question. Most importantly, it’s not focusing on the accuracy of the generated data, but rather on the performance of the LLMs.

Experiment Setup

Task: Generate a JSON object adhering to the schema provided. Think of mocking data for a database.

I tested six LLM configurations across three JSON schema generation methods:

  1. gpt-4o-mini generating JSONs using Tool Calls
  2. gpt-4o-mini generating JSONs using JSON mode
  3. gpt-4o-mini generating JSONs using JSON mode using strict = True
  4. gpt-4o-2024-08-06 generating JSONs using Tool Calls
  5. gpt-4o-2024-08-06 generating JSONs using JSON mode
  6. gpt-4o-2024-08-06 generating JSONs using JSON mode using strict = True

And three JSON schema complexities:

  1. Wide JSON: 25 fields with 1 level of nesting
  2. Complex JSON: 25 fields with 5 levels of nesting
  3. Super Complex JSON: 100 fields with 5 levels of nesting
I wanna see the Super Complex JSON!
export const superComplexJsonSchema = z.object({
id: z.number().int(),
name: z.string(),
details: z.object({
personalInfo: z.object({
age: z.number().int(),
dateOfBirth: z.string(),
nationality: z.string(),
maritalStatus: z.enum(["single", "married", "divorced", "widowed"]),
contact: z.object({
email: z.string(),
phone: z.string(),
alternativePhone: z.string(),
address: z.object({
street: z.string(),
city: z.string(),
state: z.string(),
country: z.string(),
postalCode: z.string(),
latitude: z.number(),
longitude: z.number(),
}),
}),
}),
professionalInfo: z.object({
occupation: z.string(),
currentEmployer: z.string(),
yearsOfExperience: z.number().int(),
education: z.array(
z.object({
institution: z.string(),
degree: z.string(),
fieldOfStudy: z.string(),
graduationYear: z.number().int(),
gpa: z.number(),
}),
),
certifications: z.array(
z.object({
name: z.string(),
issuingOrganization: z.string(),
dateObtained: z.string(),
expirationDate: z.string(),
}),
),
experience: z.array(
z.object({
company: z.string(),
position: z.string(),
startDate: z.string(),
endDate: z.string(),
isCurrent: z.boolean(),
responsibilities: z.array(z.string()),
skills: z.array(z.string()),
reportsTo: z.string(),
}),
),
skills: z.array(
z.object({
name: z.string(),
category: z.string(),
level: z.enum(["beginner", "intermediate", "advanced", "expert"]),
yearsOfExperience: z.number(),
}),
),
languages: z.array(
z.object({
language: z.string(),
proficiency: z.enum(["basic", "conversational", "fluent", "native"]),
certifications: z.array(z.string()),
}),
),
}),
}),
preferences: z.object({
favoriteColors: z.array(z.string()),
hobbies: z.array(
z.object({
name: z.string(),
category: z.string(),
frequency: z.string(),
yearsOfExperience: z.number(),
relatedSkills: z.array(z.string()),
}),
),
travelPreferences: z.object({
accommodationType: z.enum(["hotel", "hostel", "airbnb", "camping"]),
budgetPerDay: z.number(),
preferredTransportation: z.array(z.string()),
}),
dietaryRestrictions: z.array(z.string()),
workPreferences: z.object({
preferredWorkEnvironment: z.enum(["office", "remote", "hybrid"]),
desiredSalary: z.number(),
willingToRelocate: z.boolean(),
preferredIndustries: z.array(z.string()),
}),
}),
financialInfo: z.object({
income: z.number(),
expenses: z.number(),
assets: z.array(
z.object({
type: z.string(),
value: z.number(),
purchaseDate: z.string(),
}),
),
liabilities: z.array(
z.object({
type: z.string(),
amount: z.number(),
interestRate: z.number(),
}),
),
}),
healthInfo: z.object({
height: z.number(),
weight: z.number(),
medications: z.array(
z.object({
name: z.string(),
dosage: z.string(),
frequency: z.string(),
}),
),
chronicConditions: z.array(z.string()),
lastCheckup: z.string(),
}),
metadata: z.object({
createdAt: z.string(),
lastUpdated: z.string(),
dataSource: z.string(),
accessLevel: z.enum(["public", "private", "restricted"]),
tags: z.array(z.string()),
}),
});

Each configuration was tested 50 times and the average time taken was recorded. Full code used to run the experiment can be found in the GitHub repo.

Full results log:
Report for schema: Complex JSON Schema
Methods sorted by performance (fastest to slowest):
1. gpt-4o-2024-08-06-non-strict-tool (Complex JSON Schema)
Average time: 4079.0854 ms
Success rate: 100.0000%
Cost: 0.1680
2. gpt-4o-mini-non-strict-json (Complex JSON Schema)
Average time: 5847.6183 ms
Success rate: 100.0000%
Cost: 0.0175
3. gpt-4o-2024-08-06-strict-json (Complex JSON Schema)
Average time: 5866.2200 ms
Success rate: 100.0000%
Cost: 0.1528
4. gpt-4o-2024-08-06-non-strict-json (Complex JSON Schema)
Average time: 6314.3933 ms
Success rate: 100.0000%
Cost: 0.3026
5. gpt-4o-mini-strict-json (Complex JSON Schema)
Average time: 7858.5114 ms
Success rate: 100.0000%
Cost: 0.0146
6. gpt-4o-mini-non-strict-tool (Complex JSON Schema)
Average time: 0.0000 ms
Success rate: 0.0000%
Cost: 0.0000
Methods sorted by cost (cheapest to most expensive):
1. gpt-4o-mini-strict-json (Complex JSON Schema)
Cost: 0.0146
Average time: 7858.5114 ms
Success rate: 100.0000%
2. gpt-4o-mini-non-strict-json (Complex JSON Schema)
Cost: 0.0175
Average time: 5847.6183 ms
Success rate: 100.0000%
3. gpt-4o-2024-08-06-strict-json (Complex JSON Schema)
Cost: 0.1528
Average time: 5866.2200 ms
Success rate: 100.0000%
4. gpt-4o-2024-08-06-non-strict-tool (Complex JSON Schema)
Cost: 0.1680
Average time: 4079.0854 ms
Success rate: 100.0000%
5. gpt-4o-2024-08-06-non-strict-json (Complex JSON Schema)
Cost: 0.3026
Average time: 6314.3933 ms
Success rate: 100.0000%
6. gpt-4o-mini-non-strict-tool (Complex JSON Schema)
Cost: 0.0000
Average time: 0.0000 ms
Success rate: 0.0000%
Report for schema: Wide JSON Schema
Methods sorted by performance (fastest to slowest):
1. gpt-4o-mini-non-strict-tool (Wide JSON Schema)
Average time: 2943.3405 ms
Success rate: 100.0000%
Cost: 0.0078
2. gpt-4o-2024-08-06-non-strict-tool (Wide JSON Schema)
Average time: 3149.5288 ms
Success rate: 100.0000%
Cost: 0.1313
3. gpt-4o-mini-non-strict-json (Wide JSON Schema)
Average time: 3603.0182 ms
Success rate: 100.0000%
Cost: 0.0115
4. gpt-4o-2024-08-06-strict-json (Wide JSON Schema)
Average time: 3852.8140 ms
Success rate: 100.0000%
Cost: 0.1023
5. gpt-4o-2024-08-06-non-strict-json (Wide JSON Schema)
Average time: 4065.5266 ms
Success rate: 100.0000%
Cost: 0.1964
6. gpt-4o-mini-strict-json (Wide JSON Schema)
Average time: 4530.8732 ms
Success rate: 100.0000%
Cost: 0.0060
Methods sorted by cost (cheapest to most expensive):
1. gpt-4o-mini-strict-json (Wide JSON Schema)
Cost: 0.0060
Average time: 4530.8732 ms
Success rate: 100.0000%
2. gpt-4o-mini-non-strict-tool (Wide JSON Schema)
Cost: 0.0078
Average time: 2943.3405 ms
Success rate: 100.0000%
3. gpt-4o-mini-non-strict-json (Wide JSON Schema)
Cost: 0.0115
Average time: 3603.0182 ms
Success rate: 100.0000%
4. gpt-4o-2024-08-06-strict-json (Wide JSON Schema)
Cost: 0.1023
Average time: 3852.8140 ms
Success rate: 100.0000%
5. gpt-4o-2024-08-06-non-strict-tool (Wide JSON Schema)
Cost: 0.1313
Average time: 3149.5288 ms
Success rate: 100.0000%
6. gpt-4o-2024-08-06-non-strict-json (Wide JSON Schema)
Cost: 0.1964
Average time: 4065.5266 ms
Success rate: 100.0000%
Report for schema: Super Complex JSON Schema
Methods sorted by performance (fastest to slowest):
1. gpt-4o-2024-08-06-strict-json (Super Complex JSON Schema)
Average time: 10743.0250 ms
Success rate: 100.0000%
Cost: 0.3004
2. gpt-4o-mini-non-strict-json (Super Complex JSON Schema)
Average time: 12884.3888 ms
Success rate: 100.0000%
Cost: 0.0393
3. gpt-4o-2024-08-06-non-strict-json (Super Complex JSON Schema)
Average time: 13041.2693 ms
Success rate: 100.0000%
Cost: 0.6619
4. gpt-4o-mini-strict-json (Super Complex JSON Schema)
Average time: 13289.3869 ms
Success rate: 100.0000%
Cost: 0.0241
5. gpt-4o-2024-08-06-non-strict-tool (Super Complex JSON Schema)
Average time: 0.0000 ms
Success rate: 0.0000%
Cost: 0.0000
6. gpt-4o-mini-non-strict-tool (Super Complex JSON Schema)
Average time: 0.0000 ms
Success rate: 0.0000%
Cost: 0.0000
Methods sorted by cost (cheapest to most expensive):
1. gpt-4o-mini-strict-json (Super Complex JSON Schema)
Cost: 0.0241
Average time: 13289.3869 ms
Success rate: 100.0000%
2. gpt-4o-mini-non-strict-json (Super Complex JSON Schema)
Cost: 0.0393
Average time: 12884.3888 ms
Success rate: 100.0000%
3. gpt-4o-2024-08-06-strict-json (Super Complex JSON Schema)
Cost: 0.3004
Average time: 10743.0250 ms
Success rate: 100.0000%
4. gpt-4o-2024-08-06-non-strict-json (Super Complex JSON Schema)
Cost: 0.6619
Average time: 13041.2693 ms
Success rate: 100.0000%
5. gpt-4o-2024-08-06-non-strict-tool (Super Complex JSON Schema)
Cost: 0.0000
Average time: 0.0000 ms
Success rate: 0.0000%
6. gpt-4o-mini-non-strict-tool (Super Complex JSON Schema)
Cost: 0.0000
Average time: 0.0000 ms
Success rate: 0.0000%

Results

Complex JSON Schema

MethodAvg Time (ms)Time Diff (ms)Success RateCostCost Diff
gpt-4o-2024-08-06-non-strict-tool4079.08540100.0000%0.1680+0.1534
gpt-4o-mini-non-strict-json5847.6183+1768.5329100.0000%0.0175+0.0029
gpt-4o-2024-08-06-strict-json5866.2200+1787.1346100.0000%0.1528+0.1382
gpt-4o-2024-08-06-non-strict-json6314.3933+2235.3079100.0000%0.3026+0.2880
gpt-4o-mini-strict-json7858.5114+3779.4260100.0000%0.01460
gpt-4o-mini-non-strict-toolN/AN/A0.0000%N/AN/A

Wide JSON Schema

MethodAvg Time (ms)Time Diff (ms)Success RateCostCost Diff
gpt-4o-mini-non-strict-tool2943.34050100.0000%0.0078+0.0018
gpt-4o-2024-08-06-non-strict-tool3149.5288+206.1883100.0000%0.1313+0.1253
gpt-4o-mini-non-strict-json3603.0182+659.6777100.0000%0.0115+0.0055
gpt-4o-2024-08-06-strict-json3852.8140+909.4735100.0000%0.1023+0.0963
gpt-4o-2024-08-06-non-strict-json4065.5266+1122.1861100.0000%0.1964+0.1904
gpt-4o-mini-strict-json4530.8732+1587.5327100.0000%0.00600

###s Super Complex JSON Schema

MethodAvg Time (ms)Time Diff (ms)Success RateCostCost Diff
gpt-4o-2024-08-06-strict-json10743.02500100.0000%0.3004+0.2763
gpt-4o-mini-non-strict-json12884.3888+2141.3638100.0000%0.0393+0.0152
gpt-4o-2024-08-06-non-strict-json13041.2693+2298.2443100.0000%0.6619+0.6378
gpt-4o-mini-strict-json13289.3869+2546.3619100.0000%0.02410
gpt-4o-2024-08-06-non-strict-toolN/AN/A0.0000%N/AN/A
gpt-4o-mini-non-strict-toolN/AN/A0.0000%N/AN/A

Key Findings

Strict vs Non-Strict: A Performance Showdown

Our benchmark revealed interesting insights into the performance differences between strict and non-strict modes across various JSON complexities:

Wide JSON Schema

In the simplest scenario, non-strict modes generally outperformed their strict counterparts:

  • The fastest performer was gpt-4o-mini with tool calls (2943.3405 ms)
  • gpt-4o-mini in Strict Mode was significantly slower (4530.8732 ms)
  • gpt-4o-2024-08-06 in Strict Mode (3852.8140 ms) was outpaced by its non-strict counterparts (3149.5288 ms with tool calls, 4065.5266 ms for JSON generation)

However, all configurations achieved a 100% success rate, indicating that strictness might be overkill for simple structures.

Complex JSON Schema

As complexity increased, the performance gap narrowed:

  • gpt-4o-2024-08-06 with tool calls led (4079.0854 ms)
  • gpt-4o-2024-08-06 in Strict Mode followed closely (5866.2200 ms)
  • Notably, gpt-4o-mini with tool calls failed completely, while its strict counterpart succeeded

This suggests that strictness becomes more valuable as JSON complexity increases, especially for less advanced models.

Super Complex JSON Schema

In the most challenging scenario, strict modes shined:

  • gpt-4o-2024-08-06 in Strict Mode was the top performer
  • Both non-strict tool methods failed entirely
  • Strict JSON methods maintained 100% success rates
  • Surprisingly, smaller model in strict mode was even slower than the bigger model in strict mode

This underscores the critical importance of strictness in handling highly complex JSON structures.

The Cold Start in Strict Mode

As Ted Sanders mentioned in this HN comment, using strict mode bears a significant cold start penalty which goes away in the subsequent runs.

The first request with each JSON schema will be slow, as we need to preprocess the JSON schema into a context-free grammar. If you don’t want that latency hit (e.g., you’re prototyping, or have a use case that uses variable one-off schemas), then you might prefer “strict”: false

How much slower it is? Here are my results:

ModelschemaavgFirstRequestTimeavgSecondRequestTimecoldStartPenalty
gpt-4o-2024-08-06Wide JSON Schema20234.05495927.3556241.37%
gpt-4o-miniWide JSON Schema21801.55015800.8192275.84%
gpt-4o-2024-08-06Complex JSON Schema24089.90757100.4283239.27%
gpt-4o-miniComplex JSON Schema26665.403910270.7880159.62%
gpt-4o-2024-08-06Super Complex JSON Schema60481.446511698.9430416.98%
gpt-4o-miniSuper Complex JSON Schema66011.376313994.1616371.71%

The more complex the JSON schema, the more painful the cold start penalty becomes.

Strictness and Cost: An Unexpected Relationship

Interestingly, the impact of strictness on cost varied between model versions:

  • For the gpt-4o-mini model, strict mode was generally cheaper (e.g., $0.0060 vs $0.0115 for Wide JSON)
  • For the 2024-08-06 model, strict mode was more cost-effective in all the cases

Caveats

While strict mode is superior for more complicated cases, it has limitations that might be disqualifying it from using it in your case:

  • Not all JSONSchema types & features are supported in strict mode. Things like allOf, oneOf, not, definitions, min, pattern, and most importantly format are not supported
  • JSON can be nested only up to 5 levels and have up to 100 properties
  • additionalProperties is not supported
  • All fields are required (but you can make them optional by doing the "type": ["string", "null"] trick)

My Practical Recommendations

Based on my findings, I recommend the following approaches:

  1. For Simple JSON Structures:

    • Prefer non-strict modes, especially tool-based methods for speed and cost-effectiveness
    • Go with smaller mini model if you can (but don’t forget about potential failures, wrap in try/catch accordingly)
  2. For Moderately Complex JSON:

    • Use non-strict modes with more advanced models (e.g., gpt-4o-2024-08-06 with tool calls)
    • Always validate the output using Zod schemas
  3. For Highly Complex JSON:

    • Strict modes are essential for reliability and success
    • Use the most advanced model available (e.g., gpt-4o-2024-08-06 in Strict Mode)
    • Be prepared for very significant cold start penalties (goes away in the subsequent runs, neglible when running multiple times in a row)
  4. Cost Considerations:

    • For simpler tasks, non-strict modes are generally more cost-effective
    • For complex tasks, strict modes can be more cost-effective, especially with advanced models

Found this insightful? If you're interested in my AI consulting, please reach out to me via email or on X