Semantic Caching

Semantic caching for LLM applications: reduce costs and latency by caching semantically similar queries with vector similarity.

Last updated 2026-06-12

SD-7

Semantic Caching

The cheapest lever in LLM system design: embed the question, search for a near-duplicate, and skip the model call entirely. Done right it cuts 20–40% of spend and 97% of latency on hit traffic; done wrong it serves wrong answers with total confidence.

Every LLM call you can avoid is money and latency you never spend — and on FAQ-shaped traffic, most questions have been asked before, just never in the same words. A semantic cache stores past questions as vectors and answers any new question that is close enough in meaning, turning a ~2-second, $0.003 model call into a ~50-millisecond, $0 lookup. This section covers the architecture, the threshold that makes or breaks it, the production rules, and the economics.

WHERE YOU ARE

You have already seen every piece once: caches as a primitive in SD-1 · System Design 101, and a toy cache wired into the support bot in SD-3 · Your First Agentic System. What is new is the semantic part — matching on meaning instead of strings — and the judgment around thresholds, invalidation, and isolation.

Semantic Caching

Semantic Caching

Why exact-match caching misses the point

More in System Design

System Design 101

AI System Design Vocabulary

Your First Agentic System

The Paradigm Shift