LLM Memory Math
When working with large language models, understanding memory requirements is crucial for choosing hardware, optimizing performance, and avoiding out-of-memory errors. Let’s break down the math behind LLM memory consumption during both training and inference.
The Basics: Model Parameters
Every LLM starts with its parameters. The memory needed for parameters alone is straightforward:
Parameter Memory = num_parameters × bytes_per_parameter
For a 7B-parameter model in float16 (2 bytes per parameter):
Parameters = 7B × 2 bytes = 14 GB
But this is just the beginning of the story.
Inference Memory
During inference, the system needs additional memory on top of parameters:
1. KV Cache
The key–value cache stores attention states for previous tokens to avoid recomputation.
KV Cache Memory = 2 × batch_size × sequence_len × num_layers × hidden_size × bytes_per_element
For a 7B model (32 layers, 4096 hidden size, batch_size=1, sequence_length=2048, in fp16):
KV Cache = 2 × 1 × 2048 × 32 × 4096 × 2 bytes ≈ 1 GB
Note: With GQA/MQA Models can reduce KV cache by sharing keys/values across attention heads. With a 4:1 ratio (H_kv = H/4), the KV cache drops by 75%.
2. Activation Memory
Temporary tensors during forward pass:
Activation Memory = batch_size × sequence_len × num_layers × hidden_size × bytes_per_element
During inference, activations include:
- Current layer’s inputs and outputs
- Attention scores and intermediate MLP states
- Scales linearly with batch size and sequence length.
For a 7B model (32 layers, 4096 hidden size, sequence length 2K, batch size 1) in fp16:
Activations ≈ 0.5 GB
Note: With techniques like Flash Attention, the quadratic attention matrices are never fully materialized, keeping activation memory linear in sequence length.
Combined
Total Inference Memory = Parameters + KV Cache + Activations
For a 7B model with 2K tokens: ~14 GB + 1 GB + 0.5 GB = 15.5 GB
Training Memory
Training memory splits into two distinct categories:
1. Model & Optimizer States (scales with parameters)
Standard mixed precision training with Adam optimizer requires:
- Model weights (fp16 + fp32 master copy): 6N bytes
- Gradients (fp16): 2N bytes
- Adam optimizer states (fp32 momentum + variance): 8N bytes
Model & Optimizer Memory = 16 bytes × num_parameters
For a 7B model: 16 bytes × 7B = 112 GB total
2. Activations (scales with batch size and sequence length)
Intermediate tensors saved during forward pass for computing gradients:
Activation Memory = c × batch_size × sequence_len × num_layers × hidden_size × bytes_per_element
Where c is a multiplier representing how many intermediate tensors are stored, c ≈ 1–2 with gradient checkpointing and ≈ 3–6 without.
Note: Activations are not sharded by ZeRO in a multi-GPU setup—each GPU needs full activation memory for its micro-batch.
Common Memory Optimization Techniques
During Inference:
- Quantization: Reduce to int8 (1 byte) or even int4 (0.5 bytes) per parameter
- GQA/MQA: Reduce KV cache by 50–75% through key–value sharing
- FlashAttention: Reduce memory usage from quadratic to linear in sequence length
- Paged Attention: Use efficient KV cache management for batched inference
During Training:
- Gradient Checkpointing: Trade compute for memory by recomputing activations
- ZeRO Optimization: Shard model states across GPUs
- LoRA/QLoRA: Train only small adapter matrices instead of the full model
- Gradient Accumulation: Simulate larger batches without growing activation memory
Conclusion
At inference, memory is dominated by parameters plus the KV cache, which grows linearly with sequence length. Training is far heavier, since optimizer states require an order of magnitude more memory per parameter and larger batch sizes introduce more memory for activations.