The New Inference Stack (2025): FlashAttention-3, FP8→FP4, and Practical Patterns for Cheap, Fast LLM Serving
MLOps
How FlashAttention-3, FP8/FP4, paged KV, and smart decoding cut costs and boost throughput in next-gen LLM serving.
21 Aug, 2025
6 min read
Abstract
In 2025, differentiation in AI is shifting from which model you run to how you serve it. This article outlines a pragmatic “inference stack” for production LLMs:
Attention kernels (FlashAttention-3)
Low-precision formats (FP8 now, FP4 soon),
Serving systems (paged KV, continuous batching)
Decoding strategies (speculative decoding when it actually helps).
We conclude with a migration plan, a risk checklist, and a simple cost model you can adapt.
TL;DR (for the busy CTO)
FA-3 delivers ~1.5–2.0x faster attention on H100s and unlocks long contexts without blowing memory. Roll it out where supported. (arXiv, tridao.me, OpenReview)
Precision roadmap: Standardize on FP8 now (Hopper/RTX 6000 Ada), prepare for FP4 on Blackwell (NVFP4/MXFP4/FP4 variants) as software support matures. (Advanced Clustering Technologies, NVIDIA Developer)
Serving system > cache hacks: Use paged KV + continuous batching (e.g., vLLM) to prevent cache thrash and lift throughput 2–4x. (arXiv)
Speculative decoding is situational: It speeds up when acceptance rates are high and the draft model is sized right; it can also slow you down. Test before adopting. (ACL Anthology)
Who this is for
Teams already shipping LLM features and watching latency / $ / throughput closely.
Infra leaders planning 6–12 month roadmaps across Hopper → Blackwell generations.
Product owners who need long-context, lower cost per token, and predictable SLAs.
1) Kernel layer: FlashAttention-3 as the new default
What it is.
FA-3 is a re-engineered attention kernel that overlaps compute/data movement (warp specialization + TMA), interleaves matmul/softmax, and exploits FP8 on Hopper, yielding 1.5–2.0x speedups vs FA-2 and much higher HW utilization.
Why you care.
Throughput: Higher tokens/sec, especially at long context lengths
Memory: Better handling of very long windows (think 128k–256k) without catastrophic slowdowns.
Compatibility: Plays well with GQA and modern serving patterns as frameworks add support. (Check your stack: Triton/TensorRT-LLM/PyTorch builds and flags.)
Watch for version skew: driver + CUDA + framework + FA-3 build must align.
2) Precision layer: FP8 now, FP4 soon
FP8 (Hopper “Transformer Engine”) — stable today for training and inference with proper calibration. Expect sizeable speed/throughput gains and memory savings vs FP16/BF16.
FP4 today (software-only quantization).
Already possible with INT4, NF4, FP4 weight-only quantization.
Used widely in open models (7B, 13B) for single-GPU serving.
Significant accuracy loss if applied naïvely, especially for large-scale or sensitive workloads.
Typical practice: keep activations in FP16/FP8, quantize only weights.
Long-tail reasoning and rare token distributions are most affected.
Compiler/runtime support will make FP4 serving more accurate and efficient.
Quantization-aware training (QAT) and adaptive formats will minimize loss.
This is the inflection point where FP4 becomes production-grade.
What to do in 2025.
Standardize on FP8 (H100 / L40S / RTX 6000 Ada) for production.
Run FP4 pilots as soon as your workloads can be validated offline; adopt when your evals (exact-match, BLEU, ROUGE, pass@k, domain metrics) show parity.
KV caches balloon with context length and fragment under naive allocators. PagedAttention (vLLM) treats KV like virtual memory: non-contiguous pages, reuse, and sharing — enabling 2–4x throughput vs older stacks at the same latency. Pair with continuous batching to keep SMs busy under irregular traffic.
Operational tips.
Right-size page size and max batch based on your latency SLO, not peak throughput screenshots.
Monitor context length distribution; a few outliers can starve batches.
For multi-tenant setups, implement per-tenant token quotas and preemption to avoid noisy neighbors.
Speculative decoding uses a draft model to propose tokens that the target model verifies. It accelerates when draft quality is high and acceptance rates are healthy; poor sizing or task mismatch can backfire. Recent work explores adaptive draft lengths to stabilize wins across tasks.
Rules of thumb.
Start with a draft ~¼–½ the target’s size; measure acceptance vs. wall-clock, not just tokens/sec.
For short prompts / low entropy outputs (e.g., classification, forms), gains may be modest; long-form generation benefits more.
Re-evaluate after switching kernels/precisions — acceptance dynamics can change.
A simple cost model you can actually use
Let:
C_gpu = $/hour per GPU
T = tokens/sec/GPU (measured end-to-end, with your traffic)
U = utilization (0–1), real average including lulls