vLLM Deep Dive Part 2: PagedAttention - The Core Innovation

第二部分:PagedAttention——核心创新

简介

PagedAttention 是 vLLM 的突破性创新,彻底改变了大语言模型的服务方式。在 PagedAttention 出现之前,高效服务 LLM 长期受到内存碎片化和浪费问题的困扰。本文将深入探讨 PagedAttention 的工作原理、重要意义,以及它在 vLLM 中的实现方式。

LLM 服务中的内存问题

理解 KV Cache

在 transformer 模型中,attention 层为每个 token 计算键(K)和值(V)向量。在生成过程中:

  1. Prefill 阶段:为所有提示词 token 计算 K、V
  2. Decode 阶段:对每个新 token,计算其 K、V,并对所有之前的 K、V 进行 attention 计算

挑战:我们必须存储所有之前的 K、V 张量(即”KV cache”),以避免重复计算。

内存大小:以 Llama-3-8B 为例:

  • 32 个 attention 层
  • 4096 隐藏维度
  • FP16 精度(2 字节)
  • 每个 token:32 层 × 4096 维度 × 2(K、V)× 2 字节 = 512 KB

对于 2048 个 token 的序列:仅 KV cache 就需要 1 GB!

碎片化问题

传统服务系统为每个请求的最大长度预先分配连续内存:

1
2
3
4
5
6
7
8
Request 1 (max 2048 tokens, actual 512):
[████░░░░░░░░░░░░] 75% wasted

Request 2 (max 2048 tokens, actual 1800):
[████████████████] 12% wasted

Request 3 (max 2048 tokens, actual 200):
[██░░░░░░░░░░░░░░] 90% wasted

问题

  1. 过度分配:必须按最坏情况的长度进行分配
  2. 碎片化:无法将一个请求的未使用空间分配给另一个请求
  3. 低吞吐量:内存成为瓶颈,而非计算

实际影响:没有 PagedAttention,你可能只能服务 10 个并发请求;而有了 PagedAttention,同样的内存可以服务 50 个以上的请求!

PagedAttention 的解决方案

PagedAttention 将虚拟内存的理念应用于 attention 计算:

核心思想:不再将 KV cache 连续存储,而是将其拆分为固定大小的 block。每个请求获得一个 block table,将逻辑位置映射到 physical block。

虚拟内存类比

虚拟内存(操作系统) PagedAttention(vLLM)
页面(Page) block(如 16 个 token)
页表(Page Table) block table
物理内存 GPU 内存
内存分配器 block pool
缺页错误(Page Fault) cache miss
写时复制(Copy-on-Write) prefix 共享

基于 block 的存储

取代连续分配的方式:

1
2
3
4
5
6
7
8
9
10
11
Request 1 (512 tokens, 32 blocks):
Blocks: [0, 3, 7, 12, ..., 89]
Scattered in GPU memory

Request 2 (1800 tokens, 113 blocks):
Blocks: [1, 2, 4, 5, ..., 105]
Can reuse freed blocks from other requests

Request 3 (200 tokens, 13 blocks):
Blocks: [6, 8, 9, ..., 18]
Minimal waste (only last block partially filled)

优势

  1. 无过度分配:按需分配 block
  2. 无碎片化:任何空闲 block 都可分配给任意请求
  3. 近零浪费:每个请求只有最后一个 block 可能是部分填充的
  4. 共享:共同前缀可共享 physical block

实现深度解析

block 结构

文件位置vllm/v1/core/kv_cache_utils.py

1
2
3
4
5
6
class KVCacheBlock:
def __init__(self, block_id: int):
self.block_id = block_id # Physical block ID
self.ref_cnt = 0 # Reference count
self.block_hash: int | None = None # Hash for prefix caching
self.is_null = False # Special null block

每个 block 代表固定数量的 token 位置(block_size,通常为 16)。

物理存储

1
2
3
4
5
6
# In GPU memory, shape: [num_blocks, num_heads, block_size, head_dim]
kv_cache = torch.empty(
(num_blocks, 2, num_heads, block_size, head_dim),
dtype=torch.float16,
device='cuda'
)

Block Pool

文件位置vllm/v1/core/block_pool.py

BlockPool 管理所有 KV cache block:

1
2
3
4
5
6
7
8
9
10
11
12
class BlockPool:
def __init__(self, num_gpu_blocks: int, enable_caching: bool, ...):
# All blocks in the pool
self.blocks: list[KVCacheBlock] = [
KVCacheBlock(idx) for idx in range(num_gpu_blocks)
]

# Free block queue (doubly-linked list)
self.free_block_queue = FreeKVCacheBlockQueue(self.blocks)

# Cache: block_hash -> KVCacheBlock
self.cached_block_hash_to_block: BlockHashToBlockMap = {}

空闲 block 队列:为缓存 block 实现 LRU 淘汰策略:

1
2
3
4
5
6
7
free_block_queue:
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ 42 │◄──►│ 17 │◄──►│ 99 │◄──►│ 3 │
└─────┘ └─────┘ └─────┘ └─────┘
▲ ▲
│ │
LRU (evict first) Most recently freed

分配

1
2
3
4
5
6
7
8
9
10
11
def allocate(self) -> KVCacheBlock:
"""Allocate a block from the free queue."""
if len(self.free_block_queue) == 0:
# No free blocks - need to evict a cached block
block = self._evict_cached_block()
else:
# Pop from free queue
block = self.free_block_queue.popleft()

block.ref_cnt += 1
return block

释放

1
2
3
4
5
6
7
8
9
10
11
def free(self, block: KVCacheBlock) -> None:
"""Free a block back to the pool."""
block.ref_cnt -= 1

if block.ref_cnt == 0:
if block.block_hash is not None:
# Cached block - add to eviction queue
self.free_block_queue.append(block)
else:
# Uncached block - immediately available
self.free_block_queue.appendleft(block)

block table

每个请求维护一个 block table,将逻辑 block 映射到 physical block:

1
2
3
4
5
6
7
8
9
# Example: Request with 50 tokens, block_size=16
# Needs 4 blocks: [0-15], [16-31], [32-47], [48-49]

block_table = [
[7], # Group 0: Logical block 0 → Physical block 7
[23], # Group 0: Logical block 1 → Physical block 23
[102], # Group 0: Logical block 2 → Physical block 102
[45] # Group 0: Logical block 3 → Physical block 45 (partial)
]

attention 计算时的查找

1
2
3
4
5
6
def get_physical_block(logical_block_idx: int) -> int:
return block_table[logical_block_idx][0] # [0] for first group

# Token 37 is in logical block 2 (37 // 16)
# Physical block: block_table[2][0] = 102
# Position in block: 37 % 16 = 5

KVCacheManager

文件位置vllm/v1/core/kv_cache_manager.py

KVCacheManager 负责协调请求的 block 分配:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class KVCacheManager:
def __init__(self, kv_cache_config, max_model_len, ...):
self.block_pool = BlockPool(...)
self.enable_caching = enable_caching

def allocate_slots(
self,
request: Request,
num_tokens: int,
num_computed_tokens: int = 0
) -> tuple[KVCacheBlocks, int]:
"""Allocate KV cache blocks for request."""

# 1. Check prefix cache for hits
cached_blocks, num_cached_tokens = \
self.get_computed_blocks(request)

# 2. Calculate how many new blocks needed
num_new_tokens = num_tokens - num_cached_tokens
num_new_blocks = (num_new_tokens + block_size - 1) // block_size

# 3. Allocate new blocks
new_blocks = [
self.block_pool.allocate()
for _ in range(num_new_blocks)
]

# 4. Combine cached + new blocks
total_blocks = cached_blocks + new_blocks

return total_blocks, num_cached_tokens

Prefix Caching:跨请求共享

PagedAttention 最强大的特性之一是 prefix caching——在具有共同前缀的请求之间共享 KV cache block。

block 哈希

每个完整的 block 根据其 token 内容获得一个 hash 值:

1
2
3
4
5
6
7
8
9
10
def compute_block_hash(
token_ids: list[int],
start: int,
block_size: int
) -> int:
"""Hash a block of tokens."""
block_tokens = tuple(token_ids[start:start + block_size])

# Include multimodal features, cache salt, etc.
return hash((block_tokens, extra_keys...))

示例

1
2
3
4
5
6
7
8
9
Prompt A: "Translate to French: Hello world"
Tokens: [1, 4521, 284, 1515, 25, 23748, 995]
Block 0: hash([1, 4521, 284, 1515, 25, 23748, 995, ...])

Prompt B: "Translate to French: How are you"
Tokens: [1, 4521, 284, 1515, 25, 1374, 389, 345]
Block 0: hash([1, 4521, 284, 1515, 25, 1374, 389, ...])

# Different hashes - no sharing

但如果有共同的系统提示词:

1
2
3
4
5
6
7
8
System: "You are a helpful assistant."
Tokens: [1, 2, 3, ..., 50] # 50 tokens

Request A: System + "What is AI?"
Request B: System + "What is ML?"

# Both share blocks 0-2 (first 48 tokens of system prompt)
# Only block 3+ differs

cache 查找

当新请求到来时:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def get_computed_blocks(self, request: Request) -> tuple[KVCacheBlocks, int]:
"""Find cached blocks matching request's prefix."""

cached_blocks = []
num_cached_tokens = 0

for block_idx, block_hash in enumerate(request.block_hashes):
# Try to find a cached block with this hash
cached_block = self.block_pool.get_cached_block(block_hash)

if cached_block is None:
# Cache miss - stop searching
break

# Cache hit!
cached_blocks.append(cached_block)
cached_block.ref_cnt += 1 # Increment reference
num_cached_tokens += block_size

return cached_blocks, num_cached_tokens

reference counting

block 使用 reference counting 实现安全共享:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Request A uses blocks [7, 23, 102]
block_7.ref_cnt = 1
block_23.ref_cnt = 1
block_102.ref_cnt = 1

# Request B shares block 7 (same prefix)
block_7.ref_cnt = 2 # Now shared!

# Request A finishes - decrement refs
block_7.ref_cnt = 1 # Still in use by B
block_23.ref_cnt = 0 # Can be freed
block_102.ref_cnt = 0 # Can be freed

# Request B finishes
block_7.ref_cnt = 0 # Now can be freed

安全性:只有当 ref_cnt == 0 时,block 才会被释放。

cache 淘汰

当 cache 已满时:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def evict_cached_block(self) -> KVCacheBlock:
"""Evict LRU cached block with ref_cnt == 0."""

# Scan free queue from left (LRU)
for block in self.free_block_queue:
if block.ref_cnt == 0:
# Found eviction candidate
self.free_block_queue.remove(block)

# Remove from cache lookup
del self.cached_block_hash_to_block[block.block_hash]
block.block_hash = None

return block

raise OutOfMemoryError("No evictable blocks")

使用 block 进行 attention 计算

Prefill 阶段

在 prefill 阶段,我们为整个提示词计算 attention:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def paged_attention_prefill(
query: Tensor, # [batch, seq_len, num_heads, head_dim]
kv_cache: Tensor, # [num_blocks, 2, num_heads, block_size, head_dim]
block_tables: Tensor, # [batch, max_num_blocks]
context_lens: Tensor # [batch]
) -> Tensor:
"""
Compute prefill attention using paged KV cache.
"""

batch_size, seq_len, num_heads, head_dim = query.shape
output = torch.empty_like(query)

for b in range(batch_size):
context_len = context_lens[b]
num_blocks = (context_len + block_size - 1) // block_size

# Gather K, V from blocks
k_cache, v_cache = [], []
for block_idx in range(num_blocks):
physical_block = block_tables[b, block_idx]
k, v = kv_cache[physical_block] # [2, num_heads, block_size, head_dim]
k_cache.append(k)
v_cache.append(v)

# Concatenate blocks
k = torch.cat(k_cache, dim=1) # [num_heads, context_len, head_dim]
v = torch.cat(v_cache, dim=1)

# Standard attention
scores = query[b] @ k.transpose(-2, -1) / sqrt(head_dim)
attn = softmax(scores, dim=-1)
output[b] = attn @ v

return output

实际上,vLLM 使用经过优化的内核(FlashAttention、FlashInfer),这些内核可以直接使用 block table 进行操作。

Decode 阶段

在 decode 阶段,每个请求生成一个 token:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def paged_attention_decode(
query: Tensor, # [batch, 1, num_heads, head_dim]
kv_cache: Tensor, # [num_blocks, 2, num_heads, block_size, head_dim]
block_tables: Tensor, # [batch, max_num_blocks]
context_lens: Tensor # [batch]
) -> Tensor:
"""
Optimized decode attention - each query attends to full context.
"""

# Use fused kernel that directly indexes into blocks
return flash_attn_decode(
q=query,
k_cache=kv_cache[:, 0], # K cache
v_cache=kv_cache[:, 1], # V cache
block_table=block_tables,
seq_lens=context_lens,
block_size=block_size
)

decode 内核经过高度优化:

  • 融合 block 聚合与 attention 计算
  • warp 级并行
  • 共享内存优化

多组 KV cache

vLLM 支持每层具有不同 cache 需求的模型(MQA、GQA、Mamba):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
kv_cache_config = KVCacheConfig(
kv_cache_groups=[
KVCacheGroup( # Group 0: Attention layers
kv_cache_spec=AttentionSpec(
num_heads=32,
head_dim=128,
block_size=16
)
),
KVCacheGroup( # Group 1: Mamba layers
kv_cache_spec=MambaSpec(
state_size=128,
block_size=16
)
)
]
)

每个请求从每个组中获取 block:

1
2
3
4
5
# Request with 50 tokens
blocks = [
[7, 23, 102, 45], # Group 0: Attention cache
[8, 24, 103, 46] # Group 1: Mamba cache
]

性能影响

内存效率

使用 PagedAttention 之前(连续分配):

1
2
3
10 requests × 2048 max length × 512 KB/token = 10.5 GB
Actual usage: 10 requests × 500 avg length = 2.5 GB
Efficiency: 24%

使用 PagedAttention 之后(block 分配):

1
2
3
500 blocks allocated across all requests
Waste: ~10 blocks (partial last blocks)
Efficiency: 98%

结果:并发请求数提升 4 倍!

吞吐量提升

真实基准测试(Llama-3-8B,运行于 H100):

指标 未使用 PagedAttention 使用 PagedAttention 提升幅度
并发请求数 12 64 5.3x
吞吐量(tok/s) 1,500 8,000 5.3x
内存占用 60 GB 60 GB 持平
延迟(TTFT) 45ms 42ms -7%

Prefix Caching 收益

使用共同系统提示词(500 个 token)时:

请求次数 未使用缓存 使用缓存 加速比
第 1 次 10ms(prefill) 10ms 1x
第 2 次 10ms 1ms 10x
第 100 次 10ms 1ms 10x

缓存命中率:对于带有系统提示词的聊天机器人,通常在 60-80% 之间。

进阶主题

滑动窗口 attention

对于像 Mistral 这样使用滑动窗口的模型:

1
2
3
4
5
6
7
# Only keep last 4096 tokens in cache
if num_tokens > sliding_window_size:
# Free old blocks outside window
blocks_to_free = num_tokens - sliding_window_size
old_blocks = blocks[:blocks_to_free // block_size]
for block in old_blocks:
self.block_pool.free(block)

用于 speculative decoding 的 copy-on-write

使用 speculative decoding 时:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Draft model proposes N tokens
draft_tokens = draft_model.generate(n=5)

# Share blocks with draft (copy-on-write)
draft_blocks = request.blocks # Same physical blocks!
draft_blocks[-1].ref_cnt += 1 # Increment last block

# Verify with main model
verified = main_model.verify(draft_tokens)

if not all_verified:
# Rollback - free draft blocks
for block in draft_blocks[verified_idx:]:
block.ref_cnt -= 1

实现中的注意事项

hash 碰撞

block 的 hash 值可能发生碰撞:

1
2
3
4
5
6
# Different token sequences with same hash (rare!)
block_a = hash([1, 2, 3, ..., 16]) = 0x123456
block_b = hash([4, 5, 6, ..., 19]) = 0x123456

# Solution: Map hash to list of blocks
cached_blocks[hash] = [block_a, block_b]

不完整的 block

最后一个 block 通常是不完整的:

1
2
3
4
5
# 50 tokens with block_size=16
# Block 3 only has 2 tokens (50 % 16 = 2)

# Must track: num_tokens_in_last_block
# For attention: mask out unused positions

线程安全

block 分配必须是线程安全的:

1
2
3
4
class BlockPool:
def allocate(self):
with self.lock: # Critical section
return self._allocate_unsafe()

核心要点

  1. PagedAttention 将 KV cache 拆分为固定大小的 block,消除了碎片化和过度分配问题

  2. block table 将逻辑 block 映射到 physical block,类似于虚拟内存的页表

  3. prefix caching 跨请求共享 block,大幅减少了冗余计算

  4. reference counting 确保安全共享,防止过早释放内存

  5. 近零内存浪费在实践中实现了 4-5 倍的吞吐量提升

  6. attention 内核经过优化,可直接使用 block 索引存储进行计算

下一步

在第三部分,我们将探索 Scheduler——vLLM 的核心调度器,它负责决定处理哪些请求、何时进行抢占,以及如何通过 continuous batching 最大化 GPU 利用率。

参考资料


PagedAttention 是 vLLM 卓越性能的基础。在下一篇文章中,我们将看到 Scheduler 如何在此之上协调请求的执行。