Volodymyr Floreskul

When working with large language models, understanding memory requirements is crucial for choosing hardware, optimizing performance, and avoiding out-of-memory errors. Let’s break down the math behind LLM memory consumption during both training and inference.

The Basics: Model Parameters

Every LLM starts with its parameters. The memory needed for parameters alone is straightforward:

Parameter Memory = num_parameters × bytes_per_parameter

For a 7B-parameter model in float16 (2 bytes per parameter):
Parameters = 7B × 2 bytes = 14 GB

But this is just the beginning of the story.

Inference Memory

During inference, the system needs additional memory on top of parameters:

1. KV Cache

The key–value cache stores attention states for previous tokens to avoid recomputation.

KV Cache Memory = 2 × batch_size × sequence_len × num_layers × hidden_size × bytes_per_element

For a 7B model (32 layers, 4096 hidden size, batch_size=1, sequence_length=2048, in fp16):
KV Cache = 2 × 1 × 2048 × 32 × 4096 × 2 bytes ≈ 1 GB

Note: With GQA/MQA Models can reduce KV cache by sharing keys/values across attention heads. With a 4:1 ratio (H_kv = H/4), the KV cache drops by 75%.

2. Activation Memory

Temporary tensors during forward pass:

Activation Memory = batch_size × sequence_len × num_layers × hidden_size × bytes_per_element

During inference, activations include:

Current layer’s inputs and outputs
Attention scores and intermediate MLP states
Scales linearly with batch size and sequence length.

For a 7B model (32 layers, 4096 hidden size, sequence length 2K, batch size 1) in fp16:
Activations ≈ 0.5 GB

Note: With techniques like Flash Attention, the quadratic attention matrices are never fully materialized, keeping activation memory linear in sequence length.

Combined

Total Inference Memory = Parameters + KV Cache + Activations

For a 7B model with 2K tokens: ~14 GB + 1 GB + 0.5 GB = 15.5 GB

Training Memory

Training memory splits into two distinct categories:

1. Model & Optimizer States (scales with parameters)

Standard mixed precision training with Adam optimizer requires:

Model weights (fp16 + fp32 master copy): 6N bytes
Gradients (fp16): 2N bytes
Adam optimizer states (fp32 momentum + variance): 8N bytes

Model & Optimizer Memory = 16 bytes × num_parameters

For a 7B model: 16 bytes × 7B = 112 GB total

2. Activations (scales with batch size and sequence length)

Intermediate tensors saved during forward pass for computing gradients:

Activation Memory = c × batch_size × sequence_len × num_layers × hidden_size × bytes_per_element

Where c is a multiplier representing how many intermediate tensors are stored, c ≈ 1–2 with gradient checkpointing and ≈ 3–6 without.

Note: Activations are not sharded by ZeRO in a multi-GPU setup—each GPU needs full activation memory for its micro-batch.

Common Memory Optimization Techniques

During Inference:

Quantization: Reduce to int8 (1 byte) or even int4 (0.5 bytes) per parameter
GQA/MQA: Reduce KV cache by 50–75% through key–value sharing
FlashAttention: Reduce memory usage from quadratic to linear in sequence length
Paged Attention: Use efficient KV cache management for batched inference

During Training:

Gradient Checkpointing: Trade compute for memory by recomputing activations
ZeRO Optimization: Shard model states across GPUs
LoRA/QLoRA: Train only small adapter matrices instead of the full model
Gradient Accumulation: Simulate larger batches without growing activation memory

Conclusion

At inference, memory is dominated by parameters plus the KV cache, which grows linearly with sequence length. Training is far heavier, since optimizer states require an order of magnitude more memory per parameter and larger batch sizes introduce more memory for activations.