LLM Inference Economics
An interactive exploration of batch size, context length, and cost tradeoffs in large language models. Based on the deep-dive lecture by Reiner Pope (CEO, MatX) and Dwarkesh Patel.
Hardware (Blackwell NVL72)
Compute (PFLOPs)1800
Mem BW (TB/s)40
HBM capacity (TB)8
Model
Active params37B
Total params700B
Context length32K
Bytes/token (KV)1024
Batch size: 100
18000
Cost per token
0.351
relative unitsLatency
35.1
ms / forward passThroughput
2.9
K tokens/sBottleneck
Memory BW
HBM drain time: 200 ms (capacity ÷ bandwidth) — this is roughly the forward-pass cadence. At B=100, the system is Memory BW-bound. Cost per token is 5935% above minimum. Increase batch size until the weight-fetch is fully amortized.