Back to Portfolio

LLM Inference Economics

An interactive exploration of batch size, context length, and cost tradeoffs in large language models. Based on the deep-dive lecture by Reiner Pope (CEO, MatX) and Dwarkesh Patel.

Hardware (Blackwell NVL72)

Compute (PFLOPs)1800
Mem BW (TB/s)40
HBM capacity (TB)8

Model

Active params37B
Total params700B
Context length32K
Bytes/token (KV)1024

Batch size: 100

18000

Cost per token

0.351

relative units

Latency

35.1

ms / forward pass

Throughput

2.9

K tokens/s

Bottleneck

Memory BW

HBM drain time: 200 ms (capacity ÷ bandwidth) — this is roughly the forward-pass cadence. At B=100, the system is Memory BW-bound. Cost per token is 5935% above minimum. Increase batch size until the weight-fetch is fully amortized.