Tested OpenAI's prompt caching across models. Found undocumented behavior

4 points by harsharanga 7 hours ago

Been building an AI agent from scratch to understand token economics. Spent a week on prompt caching. Found something interesting that isn't in OpenAI's docs. Setup: Network device monitoring chatbot, 10 tools, ~1,400 token prefix. Tested gpt-4o-mini, gpt-5-mini, gpt-5. Logged cached_tokens from every response.

Finding 1: Caching works as documented Once prefix exceeds 1024 tokens, OpenAI caches it automatically. I saw 80-90% cache hit rates after the first call. Cost reduction of 47-49% on input tokens. Cache discount is 50% for 4o-mini, 90% for gpt-5 family.

Finding 2: Tool schema tokenization is heavily compressed Added 4 tools to my existing 6. Expected +400-500 tokens based on JSON size. Actual increase: 56 tokens. OpenAI is clearly doing aggressive compression on function schemas.

Finding 3: Cache is shared across model generations (undocumented) This is the interesting part. Test: Call gpt-4o-mini first (cold start). Wait 5 seconds. Call gpt-5-mini with identical prefix. Result: gpt-5-mini got a cache hit on its first call. Tested all permutations. Every time, model 2 and 3 hit cache from model 1's warmup. The prefix-processing cache is shared across 4o-mini, 5-mini, and 5. I couldn't find this documented anywhere.

Why it matters: If you have many cold starts (separate user sessions, different contexts), you can warm cache with the cheapest model. Example - 1,000 cold starts/day, 10K token prefix, primary model gpt-5: Without cross-model warming: Each session pays 10K tokens at $1.25/1M = $0.0125 Daily: $12.50, Annual: $4,562 With nano warming first: 10K tokens at $0.05/1M = $0.0005 per warmup Daily: $0.50, Annual: $182 Savings: $4,380/year At gpt-5-pro pricing ($15/1M), difference is $54K+/year on warmup costs alone.

Technical note: This is prefix-processing cache sharing, not KV-cache sharing. Models share tokenization and prefix hashing, not attention states. But billing-wise, cached tokens are cached tokens.

Reproduction: Create 1024+ token prefix. Call model A, log cached_tokens. Call model B with same prefix. Check if B's first call shows cached tokens. Field is in response.usage.prompt_tokens_details.cached_tokens. Happy to share test scripts.