Why Memory Systems Matter in Large Language Models

Large Language Models like GPT-4, Llama 3, and Claude are often discussed in terms of parameters, training data, and capabilities. But beneath the headlines lies a less glamorous truth: memory systems are the silent bottleneck determining what these models can actually do. Without efficient memory architectures, the most powerful model is unusable — too slow, too expensive, or simply unable to fit into available hardware.

This post explores why memory matters for LLMs, how transformer architectures consume memory, the difference between compute-bound and memory-bound operations, and the innovative techniques researchers use to push the boundaries of what fits into GPU memory. Understanding these constraints is essential for anyone deploying, fine-tuning, or serving LLMs in production.

The Memory Wall: Why GPUs Aren't Enough

Modern GPUs have tremendous computational throughput. An NVIDIA H100 can perform 1,979 teraFLOPS (FP16) — nearly 2 quadrillion floating-point operations per second. Yet serving a single 70 billion parameter model often achieves only 10-20% of theoretical peak FLOPs. The culprit is memory bandwidth.

The memory wall refers to the growing gap between processor speed and memory speed. Computation gets exponentially faster; memory latency and bandwidth improve slowly. For LLMs, almost every operation touches memory. The model weights, attention keys and values, activations, and gradients must all move between compute units and memory.

Memory bandwidth vs compute capacity (H100 SXM):

- HBM3 memory bandwidth: 3.35 TB/s
- FP16 compute: 1,979 TFLOPS

For a 70B model (140 GB in FP16):
- Reading all weights once takes: 140 GB / 3.35 TB/s = 42 milliseconds
- During that time, the GPU could perform: 1,979 TFLOPS × 0.042s = 83 TFLOPS of work
- But most operations are memory-bound, not compute-bound

Memory bandwidth is the true bottleneck for LLM inference. You can have infinite compute, but you are limited by how fast you can move weights from HBM to the compute units.

Where Memory Goes: Breaking Down LLM Memory Usage

A large language model consumes memory in several distinct categories. Understanding each is critical for optimization.

1. Model Weights (Parameters)

Each parameter is typically stored as FP16 (2 bytes) or BF16 (2 bytes). Some use FP32 (4 bytes) for higher precision. Quantized models use INT8 (1 byte) or INT4 (0.5 bytes).

Memory for weights by model size (FP16):

7B parameters → 14 GB
13B parameters → 26 GB
32B parameters → 64 GB
70B parameters → 140 GB
130B parameters → 260 GB
405B parameters (Llama 3.1) → 810 GB

No single GPU can hold the largest models. The H100 has 80 GB HBM. The A100 has 40-80 GB. RTX 4090 has 24 GB. Even 70B models require multiple GPUs or aggressive quantization.

2. Key-Value (KV) Cache

During autoregressive generation, the model attends to all previous tokens. Instead of recomputing keys and values for each token, the model caches them. The KV cache grows linearly with sequence length and batch size.

KV cache memory for a 70B model (80 layers, 8 KV heads, 128 dimension per head):

Per token memory = 2 (key + value) × layers × KV_heads × head_dim × bytes_per_param
                = 2 × 80 × 8 × 128 × 2 (FP16)
                = 327,680 bytes per token ≈ 0.33 MB per token

For batch size 32, sequence length 4096:
KV cache = 32 × 4096 × 0.33 MB = 43 GB

The KV cache can exceed the model weights themselves!

This explains why long-context models (128K, 1M tokens) are so memory-intensive. The KV cache grows linearly with context, making billion-token contexts extraordinarily expensive.

3. Activations

During training, activations from each layer must be stored for backpropagation. During inference, activations are smaller but still significant. Activation memory scales with batch size and sequence length.

4. Optimizer State (Training Only)

Training requires additional memory for optimizer state. Adam optimizer stores two values per parameter (momentum and variance). With FP32, that's 8 bytes per parameter, plus the model weights themselves.

Training memory for a 7B model with Adam:

- Model weights (FP16): 14 GB
- Gradients (FP16): 14 GB
- Optimizer momentum (FP32): 28 GB
- Optimizer variance (FP32): 28 GB
- Total: 84 GB minimum

This is why training is much more memory-intensive than inference.

Attention: The Memory Quadratic Nightmare

The attention mechanism is the heart of transformers but also the primary memory bottleneck. Standard multi-head attention has O(n²) memory complexity relative to sequence length.

# Standard attention memory calculation
def attention_memory(seq_len, num_heads, head_dim, batch_size):
    # Q @ K^T produces attention scores of shape [batch, heads, seq_len, seq_len]
    attention_scores_memory = batch_size * num_heads * seq_len * seq_len * 2  # FP16
    return attention_scores_memory

# Example: 32 batch, 32 heads, 4096 sequence length
# Attention scores = 32 × 32 × 4096 × 4096 × 2 bytes = 34 GB
# Just the attention matrix! Before softmax, before multiplying by V.

Flash Attention revolutionized LLM training by eliminating the quadratic memory bottleneck. Instead of materializing the full attention matrix, Flash Attention tiles the computation and uses kernel fusion to write results directly to HBM.

Flash Attention memory reduction:

Standard attention: O(n²) memory
Flash Attention: O(n) memory

For 64K sequence length:
- Standard: Impossible (terabytes of memory)
- Flash Attention: ~100 MB

Flash Attention-2 and Flash Attention-3 further optimize the tiling strategy, achieving up to 2-3x speedups over standard attention while using fractionally less memory.

Memory-Bound vs Compute-Bound: Understanding the Bottleneck

LLM inference is almost always memory-bound, not compute-bound. The GPU spends most of its time waiting for data from HBM rather than performing arithmetic.

How to identify the bottleneck:

# Theoretical maximum throughput if memory-bound
def memory_bound_throughput(memory_bandwidth, model_size):
    # Time to load entire model = model_size / bandwidth
    # Maximum tokens per second = 1 / time_per_token
    time_per_token = model_size / memory_bandwidth
    return 1 / time_per_token

# For 70B model (140 GB) on H100 (3.35 TB/s):
# time_per_token = 140 GB / 3,350 GB/s = 42 ms
# maximum throughput = 24 tokens per second

# Actual observed throughput is slightly lower due to overhead

Compute-bound operations occur only in specific scenarios:

  • Very small models (less than 1B parameters)
  • Extremely long sequences where attention computation dominates
  • Training with very large batches
Real-world inference characteristics:

Model Size      Bottleneck      Tokens/sec (H100)
1B              Memory-bound    ~200-300
7B              Memory-bound    ~80-120
13B             Memory-bound    ~45-60
70B             Memory-bound    ~20-30
405B (quantized) Memory-bound    ~5-10

Notice: Tokens/sec inversely scales with model size — classic memory-bound behavior.

Techniques for Reducing Memory Usage

The entire field of efficient LLM inference revolves around reducing memory footprint while maintaining quality.

1. Quantization

Quantization reduces the precision of model weights. INT8 uses 1 byte per parameter (2x reduction from FP16). INT4 uses 0.5 bytes (4x reduction). GPTQ, AWQ, and GGUF are popular quantization algorithms.

# Weight-only INT4 quantization
Original FP16 weights: 14 GB for 7B model
INT4 quantized: 3.5 GB for 7B model
Memory reduction: 75%

# Quality impact (perplexity on WikiText-2)
FP16: 5.50
INT8: 5.52 (minimal loss)
INT4 (GPTQ): 5.58 (small loss)
INT4 (naive): 6.20 (significant loss)

2. PagedAttention (vLLM)

vLLM's PagedAttention manages the KV cache like an operating system manages virtual memory. Instead of contiguous memory allocation, the KV cache is divided into pages. This eliminates internal fragmentation and allows efficient sharing of cached tokens across requests.

Traditional KV cache: Pre-allocates max sequence length per request
- Wastes memory when requests have varying lengths
- 30-50% memory fragmentation is common

PagedAttention: Allocates pages on demand
- 5-10% fragmentation
- Supports prefix caching (common prompt reuse)
- 2-4x higher throughput in production

3. Continuous Batching

Traditional batching waits for all requests in a batch to finish generating. This wastes memory on idle requests. Continuous batching (or iteration-level batching) returns completed requests immediately and inserts new ones.

Continuous batching memory benefit:

Traditional: Batch of 32, all must finish together
- Memory reserved for all 32 until slowest finishes
- Average request waits for others

Continuous: New requests join after each iteration
- Memory used only for actively generating requests
- 2-5x higher throughput in heterogeneous workloads

4. Speculative Decoding

Speculative decoding uses a small "draft" model (1B parameters) to generate multiple tokens quickly. A larger "target" model (70B) verifies all draft tokens in parallel. This reduces memory accesses per generated token.

Standard autoregressive generation:
- Each token: load 70B weights (140 GB), compute 1 token
- Memory per token: 140 GB of weight reads

Speculative decoding (4 speculative tokens):
- Draft model (1B, 2 GB) generates 4 tokens
- Target model (70B, 140 GB) verifies all 4 in one forward pass
- Memory per verified token: (140 GB + 2 GB) / 4 = 35.5 GB
- 4x reduction in memory bandwidth requirement

5. Mixture of Experts (MoE)

MoE models like Mixtral 8x7B have 47B total parameters but only activate ~12B per forward pass. The inactive experts can remain in slower memory tiers or be paged out.

MoE memory advantage:

Mixtral 8x7B (47B total FP16 = 94 GB)
Active experts per token: 2 (12B = 24 GB)
Gate network: small
Total active memory: ~25 GB vs 94 GB total

This is why MoE models achieve better throughput per FLOP.

Memory Hierarchy: From HBM to CPU RAM to Disk

GPUs have a memory hierarchy. Understanding it allows strategic placement of model components.

NVIDIA H100 Memory Hierarchy:

L1 cache: 256 KB per SM, 10-20 cycles latency, 80 TB/s bandwidth
L2 cache: 50 MB shared, 200-300 cycles, 12 TB/s
HBM3: 80 GB, 500-800 cycles, 3.35 TB/s
CPU RAM (PCIe): Up to 2 TB, 10-20 microseconds, 64 GB/s
NVMe (disk): 2-8 TB, 50-200 microseconds, 7 GB/s

Latency ratio: L1(1x) → HBM(50x) → PCIe(2000x) → Disk(10000x)

Layer-wise offloading keeps active layers in HBM and offloads others to CPU RAM. For a 70B model with 80 layers:

# Layer-wise offloading schedule
def offload_strategy(batch_size, sequence_length):
    # Forward pass through layers 0-79
    for layer in range(80):
        # Load layer 'layer' from CPU RAM to HBM
        load_to_hbm(layer)
        
        # Compute attention and feed-forward
        compute_layer(batch_size, sequence_length)
        
        # Offload layer back to CPU RAM (if not next layer)
        if layer < 79:
            offload_to_cpu_ram(layer)

# HBM usage: ~2 layers + activations + KV cache = 10-15 GB
# CPU RAM usage: remaining 78 layers = 130 GB
# Enables 70B model on 24 GB GPU!

Unified memory on systems like NVIDIA Grace Hopper allows the CPU and GPU to share a coherent memory space. This simplifies offloading but doesn't eliminate the PCIe bottleneck.

Real-World Memory Optimization: Case Study

Consider serving Llama 3 70B (140 GB FP16) on 4×H100 (each 80 GB HBM, total 320 GB). Naive approach: split weights across 4 GPUs using tensor parallelism.

Tensor parallel (TP=4):
- Each GPU holds 35 GB of weights (140 GB / 4)
- KV cache also split (each GPU holds 1/4)
- Memory per GPU: 35 GB (weights) + ~20 GB (KV cache) = 55 GB

Result: Works comfortably. But what if you only have 2×H100 (160 GB total)?

Optimizations needed:
1. Quantization to INT4: 140 GB → 35 GB total weights
2. Tensor parallel (TP=2): 17.5 GB per GPU
3. KV cache in INT8: reduces memory by 50%
4. PagedAttention to handle fragmentation

Final memory per GPU: 17.5 GB (weights) + 10 GB (KV cache) = 27.5 GB
Now fits comfortably in 80 GB H100s.

The Future: Memory Innovations for LLMs

Several emerging technologies promise to break the memory wall:

HBM4 (coming 2025-2026) will offer 4-8 TB/s bandwidth and 128-256 GB per stack. This directly increases inference throughput by reducing weight-loading time.

CXL (Compute Express Link) allows memory pooling across devices. A cluster of 8 GPUs could share a unified memory pool of 2 TB, eliminating tensor parallelism overhead.

Processing-in-Memory (PIM) moves compute to the memory chips. The HBM itself performs matrix multiplication, dramatically reducing data movement. Early research shows 10-100x energy efficiency improvements.

Sparse attention mechanisms like Mamba, RWKV, and Hyena aim to replace quadratic attention with linear or sub-quadratic alternatives. These reduce KV cache memory from O(n²) to O(n) or O(n log n).

Memory scaling by architecture (sequence length 1M):

Transformer with Flash Attention: O(n) for KV cache = 330 GB
Transformer without Flash Attention: O(n²) = impossible (>500 TB)
Linear attention (Mamba): O(n) KV cache = 330 GB
But Mamba has no separate KV cache — state is fixed size!

Recurrent architectures have O(1) memory for context.

Final Thoughts

Memory systems are the hidden determinant of what LLMs can actually do. A model's parameter count is meaningless if you cannot fit it into memory. A long context window is useless if the KV cache exhausts your GPUs. A fast inference speed is irrelevant if memory bandwidth limits you to single-digit tokens per second.

The most important metric for LLM deployment is not FLOPs or parameter count. It is memory bandwidth per parameter — how quickly you can move weights through the system. Everything else — quantization, pruning, distillation, MoE, speculative decoding — are techniques to reduce the memory footprint or hide memory latency.

As models grow to 1 trillion parameters and context windows to millions of tokens, memory will remain the central challenge. The researchers and engineers who understand memory — not just algorithms — will build the systems that actually work at scale.

Next time you marvel at a model's capabilities, ask: How much memory does it need? What is the KV cache size at full context? What precision are the weights? The answers reveal the true engineering behind the magic.

Related Posts

Why Memory Systems Matter in Large Language Models

Large Language Models like GPT-4, Llama 3, and Claude are often discussed in terms of parameters, training data, and capabilities. But beneath the headlines lies a less glamorous truth: memory system

Read More

Real-Time Robotics with LLMs: Challenges in Latency, Memory, and Safety

The integration of Large Language Models into robotics has captured the imagination of researchers and engineers worldwide. Imagine telling a robot "clean the kitchen" and watching it understand cont

Read More