Inference Optimization

LLM inference optimization: batching, quantization, KV-cache, speculative decoding, and hardware selection.

Last updated 2026-06-12

SD-19

LLM Inference Optimization

Where the milliseconds and the money go: prefill vs decode, KV caches, speculative decoding, quantization, parallelism — and the self-host vs API break-even math.

Latency and cost are the two axes every LLM system gets graded on, and both are decided at inference time. This section takes apart where the milliseconds go — prefill vs decode, TTFT vs throughput — walks the levers that move them (KV caching, speculative decoding, continuous batching, quantization, parallelism), and ends where production decisions actually end: the break-even math between renting GPUs and paying per token.

Before the math: the territory in plain English

Everything in this section follows from one mechanical fact: an LLM never generates an answer — it generates one token, then starts over. A 250-token reply means running the model 250 times. The first of those runs is different in kind from the other 249, and that difference is the map for every optimization below.

Inference Optimization

LLM Inference Optimization

Before the math: the territory in plain English

More in System Design

System Design 101

AI System Design Vocabulary

Your First Agentic System

The Paradigm Shift