What is inference engineering? Deepdive
Published on 31.03.2026
What is inference engineering? Deepdive
TLDR: Inference engineering is the discipline of making large language model outputs faster, cheaper, and more reliable at scale — and with the gap between open and closed models effectively closed as of late 2024, it's becoming a core competency for any engineering team running AI in production.
Summary:
If you've been watching the AI landscape for the past couple of years, you've probably noticed a quiet but seismic shift happening beneath the surface of all the model announcement noise. Two years ago, LLMs were a curiosity that mostly lived in research papers and demo videos. Today, in 2026, nearly every software engineer touches an LLM daily — whether through a coding assistant, a search interface, or something embedded so deeply in a product they don't even see it. The question has evolved from "should we use AI?" to "how do we run this thing at scale without it costing us a fortune or melting under load?" That's exactly the problem inference engineering exists to solve. Gergely Orosz at The Pragmatic Engineer teamed up with Philip Kiely — a software engineer with four years at inference startup Baseten and the author of the free e-book "Inference Engineering" — to produce one of the most thorough treatments of this topic you'll find outside of an internal company wiki.
To understand why inference engineering matters, you first need to understand what inference actually is. Training is the phase where a model learns from data — it's expensive, slow, and done relatively rarely. Inference is what happens every single time you send a prompt: the model takes your input and generates output one token at a time, in a sequential autoregressive loop. That sequential nature is fundamental to understanding the engineering challenges involved. You can't just throw more compute at inference the way you might parallelize a batch job, because each token depends on the previous one. The hardware story here is dominated by NVIDIA, not because of some arbitrary market outcome, but because of CUDA — a software ecosystem that has years of library support, tooling, and developer familiarity baked in. The datacenter GPU of the moment is the NVIDIA B200, but teams also run inference on workstation and personal computing GPUs depending on their latency, compliance, and cost requirements. Cloud providers like AWS, GCP, and CoreWeave sit alongside on-premises and air-gapped deployments, and at real scale, multi-cloud inference becomes a necessity rather than a luxury.
The software stack that sits on top of this hardware is layered in a way that rewards understanding each level. At the bottom you have CUDA, then deep learning frameworks like PyTorch, then inference engines — vLLM, SGLang, and TensorRT-LLM being the most prominent — and then higher-level orchestration like NVIDIA Dynamo. Autoscaling via Kubernetes is the baseline assumption, but the decisions around when and how to scale are more nuanced than they appear. Traffic-based scaling and utilization-based scaling lead to different tradeoffs, and getting this wrong means either paying for idle capacity or having your users stare at spinners. The five core techniques that inference engineers reach for to improve performance are quantization, speculative decoding, prefix caching, parallelism, and disaggregation. Each one addresses a different bottleneck, and using them well requires understanding not just the technique itself but the interaction effects between them and the specific usage pattern of your workload.
Quantization reduces the numerical precision of model weights — moving from FP16 to FP8 to INT4 — and each step down can yield thirty to fifty percent better performance, though with increasing risk of quality degradation, particularly in attention layers where precision matters most. Speculative decoding is a clever trick where a smaller draft model generates candidate tokens and the larger target model validates them, effectively getting you more tokens per forward pass — but it works best at low batch sizes and low temperature, so it's not a universal win. Prefix caching reuses the key-value cache for requests that share common prefixes, which is enormously valuable for long system prompts, retrieval-augmented generation pipelines, and code completion scenarios where the context is largely the same across requests. Tensor parallelism splits individual layers across multiple GPUs, while Expert Parallelism — relevant for mixture-of-experts architectures — assigns different experts to different GPUs. And disaggregation, arguably the most architecturally significant of the five, separates the prefill phase from the decode phase and runs them on different hardware, because prefill is compute-bound and determines your time-to-first-token, while decode is memory-bound and determines your tokens-per-second throughput.
The broader strategic framing Gergely offers is worth sitting with: inference engineering is essentially the AI version of the build-versus-buy decision. When you're spending significant money on inference from closed-model vendors, the economics of running your own inference on open models start to make sense — and you gain control over latency, reliability, cost, and compliance in the process. The catalyst that made this decision calculus shift dramatically was DeepSeek V3 and R1, released in December 2024. Before those models arrived, there was a meaningful quality gap between closed and open models that made the buy-side argument easy. That gap has now essentially closed. Cursor built their Composer 2.0 feature on top of the open Kimi 2.5 model using inference engineering techniques to make it competitive on speed. This is what the future looks like: teams that understand inference engineering will have a genuine competitive advantage, while teams that don't will be permanently dependent on vendors for both the model and the pricing.
Key takeaways:
- Inference is the token-generation phase after training — sequential, memory-intensive, and the primary cost driver in production AI systems
- The five core inference optimization techniques are quantization, speculative decoding, prefix caching, parallelism (tensor and expert), and disaggregation
- NVIDIA dominates inference hardware due to the CUDA ecosystem, not just raw silicon performance
- Key metrics to track are TTFT (time to first token), TPS (tokens per second), and ITL (intertoken latency)
- The quality gap between open and closed models closed with DeepSeek V3/R1 in December 2024, making self-hosted inference a realistic option for more teams
- Inference engineering becomes worth pursuing when your inference spend at vendors reaches a scale where the operational overhead pays for itself in cost and control
- Autoscaling with Kubernetes is the baseline, but traffic-based versus utilization-based scaling decisions require careful tuning for LLM workloads
Why do I care:
If you're building anything that calls an LLM more than a handful of times per day, the choices being made at the inference layer are directly affecting your users' experience and your team's AWS bill. Understanding these five techniques — even at a conceptual level — changes how you think about architecture decisions: whether to use a long shared system prompt or a shorter one, whether your latency budget fits a speculative decoding setup, whether you're hitting a memory wall or a compute wall when things slow down. This isn't academic knowledge anymore. With open models now genuinely competitive with closed ones, inference engineering is becoming a first-class engineering discipline that sits right next to reliability engineering and performance engineering on the list of things that separates teams shipping great AI products from teams shipping slow, expensive ones.