Benchmarking OpenAI LLMs for JSON Generation • rwilinski.ai

Generating structured data from unstructured inputs is one of the core use cases for AI. Recent release from OpenAI has introduced a new Structured Outputs feature that allows you to generate JSONs with a given schema. While it promises guarantees of correctness, I was curious to see how well it performs and what are the caveats.

In this post, I benchmark two different OpenAI models across three different methods of JSON generation and analyze the results focusing on error rates, performance and costs.

Disclaimer: This is not a comprehensive benchmark, but rather a quick experiment that aimed to answer a specific question. Most importantly, it’s not focusing on the accuracy of the generated data, but rather on the performance of the LLMs.

Experiment Setup

Task: Generate a JSON object adhering to the schema provided. Think of mocking data for a database.

I tested six LLM configurations across three JSON schema generation methods:

gpt-4o-mini generating JSONs using Tool Calls
gpt-4o-mini generating JSONs using JSON mode
gpt-4o-mini generating JSONs using JSON mode using strict = True
gpt-4o-2024-08-06 generating JSONs using Tool Calls
gpt-4o-2024-08-06 generating JSONs using JSON mode
gpt-4o-2024-08-06 generating JSONs using JSON mode using strict = True

And three JSON schema complexities:

Wide JSON: 25 fields with 1 level of nesting
Complex JSON: 25 fields with 5 levels of nesting
Super Complex JSON: 100 fields with 5 levels of nesting

I wanna see the Super Complex JSON!

export const superComplexJsonSchema = z.object({
  id: z.number().int(),
  name: z.string(),
  details: z.object({
    personalInfo: z.object({
      age: z.number().int(),
      dateOfBirth: z.string(),
      nationality: z.string(),
      maritalStatus: z.enum(["single", "married", "divorced", "widowed"]),
      contact: z.object({
        email: z.string(),
        phone: z.string(),
        alternativePhone: z.string(),
        address: z.object({
          street: z.string(),
          city: z.string(),
          state: z.string(),
          country: z.string(),
          postalCode: z.string(),
          latitude: z.number(),
          longitude: z.number(),
        }),
      }),
    }),
    professionalInfo: z.object({
      occupation: z.string(),
      currentEmployer: z.string(),
      yearsOfExperience: z.number().int(),
      education: z.array(
        z.object({
          institution: z.string(),
          degree: z.string(),
          fieldOfStudy: z.string(),
          graduationYear: z.number().int(),
          gpa: z.number(),
        }),
      ),
      certifications: z.array(
        z.object({
          name: z.string(),
          issuingOrganization: z.string(),
          dateObtained: z.string(),
          expirationDate: z.string(),
        }),
      ),
      experience: z.array(
        z.object({
          company: z.string(),
          position: z.string(),
          startDate: z.string(),
          endDate: z.string(),
          isCurrent: z.boolean(),
          responsibilities: z.array(z.string()),
          skills: z.array(z.string()),
          reportsTo: z.string(),
        }),
      ),
      skills: z.array(
        z.object({
          name: z.string(),
          category: z.string(),
          level: z.enum(["beginner", "intermediate", "advanced", "expert"]),
          yearsOfExperience: z.number(),
        }),
      ),
      languages: z.array(
        z.object({
          language: z.string(),
          proficiency: z.enum(["basic", "conversational", "fluent", "native"]),
          certifications: z.array(z.string()),
        }),
      ),
    }),
  }),
  preferences: z.object({
    favoriteColors: z.array(z.string()),
    hobbies: z.array(
      z.object({
        name: z.string(),
        category: z.string(),
        frequency: z.string(),
        yearsOfExperience: z.number(),
        relatedSkills: z.array(z.string()),
      }),
    ),
    travelPreferences: z.object({
      accommodationType: z.enum(["hotel", "hostel", "airbnb", "camping"]),
      budgetPerDay: z.number(),
      preferredTransportation: z.array(z.string()),
    }),
    dietaryRestrictions: z.array(z.string()),
    workPreferences: z.object({
      preferredWorkEnvironment: z.enum(["office", "remote", "hybrid"]),
      desiredSalary: z.number(),
      willingToRelocate: z.boolean(),
      preferredIndustries: z.array(z.string()),
    }),
  }),
  financialInfo: z.object({
    income: z.number(),
    expenses: z.number(),
    assets: z.array(
      z.object({
        type: z.string(),
        value: z.number(),
        purchaseDate: z.string(),
      }),
    ),
    liabilities: z.array(
      z.object({
        type: z.string(),
        amount: z.number(),
        interestRate: z.number(),
      }),
    ),
  }),
  healthInfo: z.object({
    height: z.number(),
    weight: z.number(),
    medications: z.array(
      z.object({
        name: z.string(),
        dosage: z.string(),
        frequency: z.string(),
      }),
    ),
    chronicConditions: z.array(z.string()),
    lastCheckup: z.string(),
  }),
  metadata: z.object({
    createdAt: z.string(),
    lastUpdated: z.string(),
    dataSource: z.string(),
    accessLevel: z.enum(["public", "private", "restricted"]),
    tags: z.array(z.string()),
  }),
});

Each configuration was tested 50 times and the average time taken was recorded. Full code used to run the experiment can be found in the GitHub repo.

Full results log:

Report for schema: Complex JSON Schema

Methods sorted by performance (fastest to slowest):
1. gpt-4o-2024-08-06-non-strict-tool (Complex JSON Schema)
   Average time: 4079.0854 ms
   Success rate: 100.0000%
   Cost: 0.1680
2. gpt-4o-mini-non-strict-json (Complex JSON Schema)
   Average time: 5847.6183 ms
   Success rate: 100.0000%
   Cost: 0.0175
3. gpt-4o-2024-08-06-strict-json (Complex JSON Schema)
   Average time: 5866.2200 ms
   Success rate: 100.0000%
   Cost: 0.1528
4. gpt-4o-2024-08-06-non-strict-json (Complex JSON Schema)
   Average time: 6314.3933 ms
   Success rate: 100.0000%
   Cost: 0.3026
5. gpt-4o-mini-strict-json (Complex JSON Schema)
   Average time: 7858.5114 ms
   Success rate: 100.0000%
   Cost: 0.0146
6. gpt-4o-mini-non-strict-tool (Complex JSON Schema)
   Average time: 0.0000 ms
   Success rate: 0.0000%
   Cost: 0.0000

Methods sorted by cost (cheapest to most expensive):
1. gpt-4o-mini-strict-json (Complex JSON Schema)
   Cost: 0.0146
   Average time: 7858.5114 ms
   Success rate: 100.0000%
2. gpt-4o-mini-non-strict-json (Complex JSON Schema)
   Cost: 0.0175
   Average time: 5847.6183 ms
   Success rate: 100.0000%
3. gpt-4o-2024-08-06-strict-json (Complex JSON Schema)
   Cost: 0.1528
   Average time: 5866.2200 ms
   Success rate: 100.0000%
4. gpt-4o-2024-08-06-non-strict-tool (Complex JSON Schema)
   Cost: 0.1680
   Average time: 4079.0854 ms
   Success rate: 100.0000%
5. gpt-4o-2024-08-06-non-strict-json (Complex JSON Schema)
   Cost: 0.3026
   Average time: 6314.3933 ms
   Success rate: 100.0000%
6. gpt-4o-mini-non-strict-tool (Complex JSON Schema)
   Cost: 0.0000
   Average time: 0.0000 ms
   Success rate: 0.0000%

Report for schema: Wide JSON Schema

Methods sorted by performance (fastest to slowest):
1. gpt-4o-mini-non-strict-tool (Wide JSON Schema)
   Average time: 2943.3405 ms
   Success rate: 100.0000%
   Cost: 0.0078
2. gpt-4o-2024-08-06-non-strict-tool (Wide JSON Schema)
   Average time: 3149.5288 ms
   Success rate: 100.0000%
   Cost: 0.1313
3. gpt-4o-mini-non-strict-json (Wide JSON Schema)
   Average time: 3603.0182 ms
   Success rate: 100.0000%
   Cost: 0.0115
4. gpt-4o-2024-08-06-strict-json (Wide JSON Schema)
   Average time: 3852.8140 ms
   Success rate: 100.0000%
   Cost: 0.1023
5. gpt-4o-2024-08-06-non-strict-json (Wide JSON Schema)
   Average time: 4065.5266 ms
   Success rate: 100.0000%
   Cost: 0.1964
6. gpt-4o-mini-strict-json (Wide JSON Schema)
   Average time: 4530.8732 ms
   Success rate: 100.0000%
   Cost: 0.0060

Methods sorted by cost (cheapest to most expensive):
1. gpt-4o-mini-strict-json (Wide JSON Schema)
   Cost: 0.0060
   Average time: 4530.8732 ms
   Success rate: 100.0000%
2. gpt-4o-mini-non-strict-tool (Wide JSON Schema)
   Cost: 0.0078
   Average time: 2943.3405 ms
   Success rate: 100.0000%
3. gpt-4o-mini-non-strict-json (Wide JSON Schema)
   Cost: 0.0115
   Average time: 3603.0182 ms
   Success rate: 100.0000%
4. gpt-4o-2024-08-06-strict-json (Wide JSON Schema)
   Cost: 0.1023
   Average time: 3852.8140 ms
   Success rate: 100.0000%
5. gpt-4o-2024-08-06-non-strict-tool (Wide JSON Schema)
   Cost: 0.1313
   Average time: 3149.5288 ms
   Success rate: 100.0000%
6. gpt-4o-2024-08-06-non-strict-json (Wide JSON Schema)
   Cost: 0.1964
   Average time: 4065.5266 ms
   Success rate: 100.0000%

Report for schema: Super Complex JSON Schema

Methods sorted by performance (fastest to slowest):
1. gpt-4o-2024-08-06-strict-json (Super Complex JSON Schema)
   Average time: 10743.0250 ms
   Success rate: 100.0000%
   Cost: 0.3004
2. gpt-4o-mini-non-strict-json (Super Complex JSON Schema)
   Average time: 12884.3888 ms
   Success rate: 100.0000%
   Cost: 0.0393
3. gpt-4o-2024-08-06-non-strict-json (Super Complex JSON Schema)
   Average time: 13041.2693 ms
   Success rate: 100.0000%
   Cost: 0.6619
4. gpt-4o-mini-strict-json (Super Complex JSON Schema)
   Average time: 13289.3869 ms
   Success rate: 100.0000%
   Cost: 0.0241
5. gpt-4o-2024-08-06-non-strict-tool (Super Complex JSON Schema)
   Average time: 0.0000 ms
   Success rate: 0.0000%
   Cost: 0.0000
6. gpt-4o-mini-non-strict-tool (Super Complex JSON Schema)
   Average time: 0.0000 ms
   Success rate: 0.0000%
   Cost: 0.0000

Methods sorted by cost (cheapest to most expensive):
1. gpt-4o-mini-strict-json (Super Complex JSON Schema)
   Cost: 0.0241
   Average time: 13289.3869 ms
   Success rate: 100.0000%
2. gpt-4o-mini-non-strict-json (Super Complex JSON Schema)
   Cost: 0.0393
   Average time: 12884.3888 ms
   Success rate: 100.0000%
3. gpt-4o-2024-08-06-strict-json (Super Complex JSON Schema)
   Cost: 0.3004
   Average time: 10743.0250 ms
   Success rate: 100.0000%
4. gpt-4o-2024-08-06-non-strict-json (Super Complex JSON Schema)
   Cost: 0.6619
   Average time: 13041.2693 ms
   Success rate: 100.0000%
5. gpt-4o-2024-08-06-non-strict-tool (Super Complex JSON Schema)
   Cost: 0.0000
   Average time: 0.0000 ms
   Success rate: 0.0000%
6. gpt-4o-mini-non-strict-tool (Super Complex JSON Schema)
   Cost: 0.0000
   Average time: 0.0000 ms
   Success rate: 0.0000%

Results

Schema:Metric:

Complex JSON Schema

Method	Avg Time (ms)	Time Diff (ms)	Success Rate	Cost	Cost Diff
gpt-4o-2024-08-06-non-strict-tool	4079.0854	0	100.0000%	0.1680	+0.1534
gpt-4o-mini-non-strict-json	5847.6183	+1768.5329	100.0000%	0.0175	+0.0029
gpt-4o-2024-08-06-strict-json	5866.2200	+1787.1346	100.0000%	0.1528	+0.1382
gpt-4o-2024-08-06-non-strict-json	6314.3933	+2235.3079	100.0000%	0.3026	+0.2880
gpt-4o-mini-strict-json	7858.5114	+3779.4260	100.0000%	0.0146	0
gpt-4o-mini-non-strict-tool	N/A	N/A	0.0000%	N/A	N/A

Wide JSON Schema

Method	Avg Time (ms)	Time Diff (ms)	Success Rate	Cost	Cost Diff
gpt-4o-mini-non-strict-tool	2943.3405	0	100.0000%	0.0078	+0.0018
gpt-4o-2024-08-06-non-strict-tool	3149.5288	+206.1883	100.0000%	0.1313	+0.1253
gpt-4o-mini-non-strict-json	3603.0182	+659.6777	100.0000%	0.0115	+0.0055
gpt-4o-2024-08-06-strict-json	3852.8140	+909.4735	100.0000%	0.1023	+0.0963
gpt-4o-2024-08-06-non-strict-json	4065.5266	+1122.1861	100.0000%	0.1964	+0.1904
gpt-4o-mini-strict-json	4530.8732	+1587.5327	100.0000%	0.0060	0

###s Super Complex JSON Schema

Method	Avg Time (ms)	Time Diff (ms)	Success Rate	Cost	Cost Diff
gpt-4o-2024-08-06-strict-json	10743.0250	0	100.0000%	0.3004	+0.2763
gpt-4o-mini-non-strict-json	12884.3888	+2141.3638	100.0000%	0.0393	+0.0152
gpt-4o-2024-08-06-non-strict-json	13041.2693	+2298.2443	100.0000%	0.6619	+0.6378
gpt-4o-mini-strict-json	13289.3869	+2546.3619	100.0000%	0.0241	0
gpt-4o-2024-08-06-non-strict-tool	N/A	N/A	0.0000%	N/A	N/A
gpt-4o-mini-non-strict-tool	N/A	N/A	0.0000%	N/A	N/A

Key Findings

Strict vs Non-Strict: A Performance Showdown

Our benchmark revealed interesting insights into the performance differences between strict and non-strict modes across various JSON complexities:

Wide JSON Schema

In the simplest scenario, non-strict modes generally outperformed their strict counterparts:

The fastest performer was gpt-4o-mini with tool calls (2943.3405 ms)
gpt-4o-mini in Strict Mode was significantly slower (4530.8732 ms)
gpt-4o-2024-08-06 in Strict Mode (3852.8140 ms) was outpaced by its non-strict counterparts (3149.5288 ms with tool calls, 4065.5266 ms for JSON generation)

However, all configurations achieved a 100% success rate, indicating that strictness might be overkill for simple structures.

Complex JSON Schema

As complexity increased, the performance gap narrowed:

gpt-4o-2024-08-06 with tool calls led (4079.0854 ms)
gpt-4o-2024-08-06 in Strict Mode followed closely (5866.2200 ms)
Notably, gpt-4o-mini with tool calls failed completely, while its strict counterpart succeeded

This suggests that strictness becomes more valuable as JSON complexity increases, especially for less advanced models.

Super Complex JSON Schema

In the most challenging scenario, strict modes shined:

gpt-4o-2024-08-06 in Strict Mode was the top performer
Both non-strict tool methods failed entirely
Strict JSON methods maintained 100% success rates
Surprisingly, smaller model in strict mode was even slower than the bigger model in strict mode

This underscores the critical importance of strictness in handling highly complex JSON structures.

The Cold Start in Strict Mode

As Ted Sanders mentioned in this HN comment, using strict mode bears a significant cold start penalty which goes away in the subsequent runs.

The first request with each JSON schema will be slow, as we need to preprocess the JSON schema into a context-free grammar. If you don’t want that latency hit (e.g., you’re prototyping, or have a use case that uses variable one-off schemas), then you might prefer “strict”: false

How much slower it is? Here are my results:

Model	schema	avgFirstRequestTime	avgSecondRequestTime	coldStartPenalty
gpt-4o-2024-08-06	Wide JSON Schema	20234.0549	5927.3556	241.37%
gpt-4o-mini	Wide JSON Schema	21801.5501	5800.8192	275.84%
gpt-4o-2024-08-06	Complex JSON Schema	24089.9075	7100.4283	239.27%
gpt-4o-mini	Complex JSON Schema	26665.4039	10270.7880	159.62%
gpt-4o-2024-08-06	Super Complex JSON Schema	60481.4465	11698.9430	416.98%
gpt-4o-mini	Super Complex JSON Schema	66011.3763	13994.1616	371.71%

The more complex the JSON schema, the more painful the cold start penalty becomes.

Strictness and Cost: An Unexpected Relationship

Interestingly, the impact of strictness on cost varied between model versions:

For the gpt-4o-mini model, strict mode was generally cheaper (e.g., $0.0060 vs $0.0115 for Wide JSON)
For the 2024-08-06 model, strict mode was more cost-effective in all the cases

Caveats

While strict mode is superior for more complicated cases, it has limitations that might be disqualifying it from using it in your case:

Not all JSONSchema types & features are supported in strict mode. Things like allOf, oneOf, not, definitions, min, pattern, and most importantly format are not supported
JSON can be nested only up to 5 levels and have up to 100 properties
additionalProperties is not supported
All fields are required (but you can make them optional by doing the "type": ["string", "null"] trick)

My Practical Recommendations

Based on my findings, I recommend the following approaches:

For Simple JSON Structures:
- Prefer non-strict modes, especially tool-based methods for speed and cost-effectiveness
- Go with smaller mini model if you can (but don’t forget about potential failures, wrap in try/catch accordingly)
For Moderately Complex JSON:
- Use non-strict modes with more advanced models (e.g., gpt-4o-2024-08-06 with tool calls)
- Always validate the output using Zod schemas
For Highly Complex JSON:
- Strict modes are essential for reliability and success
- Use the most advanced model available (e.g., gpt-4o-2024-08-06 in Strict Mode)
- Be prepared for very significant cold start penalties (goes away in the subsequent runs, neglible when running multiple times in a row)
Cost Considerations:
- For simpler tasks, non-strict modes are generally more cost-effective
- For complex tasks, strict modes can be more cost-effective, especially with advanced models

Found this insightful? If you're interested in my AI consulting, please reach out to me via email or on X