LLM Inference Economics

An interactive exploration of batch size, context length, and cost tradeoffs in large language models. Based on the deep-dive lecture by Reiner Pope (CEO, MatX) and Dwarkesh Patel.

Hardware (Blackwell NVL72)

Compute (PFLOPs)1800

Mem BW (TB/s)40

HBM capacity (TB)8

Model

Active params37B

Total params700B

Context length32K

Bytes/token (KV)1024

Batch size: 100

18000

Cost per token

0.351

relative units

Latency

35.1

ms / forward pass

Throughput

2.9

K tokens/s

Bottleneck

Memory BW

HBM drain time: 200 ms (capacity ÷ bandwidth) — this is roughly the forward-pass cadence. At B=100, the system is Memory BW-bound. Cost per token is 5935% above minimum. Increase batch size until the weight-fetch is fully amortized.