vLLM Deep Dive Part 2: PagedAttention - The Core Innovation

Posted on 2026-04-13 Edited on 2026-04-14 In AI/ML

第二部分：PagedAttention——核心创新

简介

PagedAttention 是 vLLM 的突破性创新，彻底改变了大语言模型的服务方式。在 PagedAttention 出现之前，高效服务 LLM 长期受到内存碎片化和浪费问题的困扰。本文将深入探讨 PagedAttention 的工作原理、重要意义，以及它在 vLLM 中的实现方式。

LLM 服务中的内存问题

理解 KV Cache

在 transformer 模型中，attention 层为每个 token 计算键（K）和值（V）向量。在生成过程中：

Prefill 阶段：为所有提示词 token 计算 K、V
Decode 阶段：对每个新 token，计算其 K、V，并对所有之前的 K、V 进行 attention 计算

挑战：我们必须存储所有之前的 K、V 张量（即”KV cache”），以避免重复计算。

内存大小：以 Llama-3-8B 为例：

32 个 attention 层
4096 隐藏维度
FP16 精度（2 字节）
每个 token：32 层 × 4096 维度 × 2（K、V）× 2 字节 = 512 KB

对于 2048 个 token 的序列：仅 KV cache 就需要 1 GB！

碎片化问题

传统服务系统为每个请求的最大长度预先分配连续内存：

Request 1 (max 2048 tokens, actual 512):
[████░░░░░░░░░░░░] 75% wasted

Request 2 (max 2048 tokens, actual 1800):
[████████████████] 12% wasted

Request 3 (max 2048 tokens, actual 200):
[██░░░░░░░░░░░░░░] 90% wasted

问题：

过度分配：必须按最坏情况的长度进行分配
碎片化：无法将一个请求的未使用空间分配给另一个请求
低吞吐量：内存成为瓶颈，而非计算

实际影响：没有 PagedAttention，你可能只能服务 10 个并发请求；而有了 PagedAttention，同样的内存可以服务 50 个以上的请求！

PagedAttention 的解决方案

PagedAttention 将虚拟内存的理念应用于 attention 计算：

核心思想：不再将 KV cache 连续存储，而是将其拆分为固定大小的 block。每个请求获得一个 block table，将逻辑位置映射到 physical block。

虚拟内存类比

虚拟内存（操作系统）	PagedAttention（vLLM）
页面（Page）	block（如 16 个 token）
页表（Page Table）	block table
物理内存	GPU 内存
内存分配器	block pool
缺页错误（Page Fault）	cache miss
写时复制（Copy-on-Write）	prefix 共享

基于 block 的存储

取代连续分配的方式：

Request 1 (512 tokens, 32 blocks):
Blocks: [0, 3, 7, 12, ..., 89]
       Scattered in GPU memory

Request 2 (1800 tokens, 113 blocks):
Blocks: [1, 2, 4, 5, ..., 105]
       Can reuse freed blocks from other requests

Request 3 (200 tokens, 13 blocks):
Blocks: [6, 8, 9, ..., 18]
       Minimal waste (only last block partially filled)

优势：

无过度分配：按需分配 block
无碎片化：任何空闲 block 都可分配给任意请求
近零浪费：每个请求只有最后一个 block 可能是部分填充的
共享：共同前缀可共享 physical block

实现深度解析

block 结构

文件位置：vllm/v1/core/kv_cache_utils.py

class KVCacheBlock:
    def __init__(self, block_id: int):
        self.block_id = block_id          # Physical block ID
        self.ref_cnt = 0                  # Reference count
        self.block_hash: int | None = None # Hash for prefix caching
        self.is_null = False              # Special null block

每个 block 代表固定数量的 token 位置（block_size，通常为 16）。

物理存储：

# In GPU memory, shape: [num_blocks, num_heads, block_size, head_dim]
kv_cache = torch.empty(
    (num_blocks, 2, num_heads, block_size, head_dim),
    dtype=torch.float16,
    device='cuda'
)

Block Pool

文件位置：vllm/v1/core/block_pool.py

BlockPool 管理所有 KV cache block：

class BlockPool:
    def __init__(self, num_gpu_blocks: int, enable_caching: bool, ...):
        # All blocks in the pool
        self.blocks: list[KVCacheBlock] = [
            KVCacheBlock(idx) for idx in range(num_gpu_blocks)
        ]

        # Free block queue (doubly-linked list)
        self.free_block_queue = FreeKVCacheBlockQueue(self.blocks)

        # Cache: block_hash -> KVCacheBlock
        self.cached_block_hash_to_block: BlockHashToBlockMap = {}

空闲 block 队列：为缓存 block 实现 LRU 淘汰策略：

free_block_queue:
┌─────┐    ┌─────┐    ┌─────┐    ┌─────┐
│ 42  │◄──►│ 17  │◄──►│ 99  │◄──►│ 3   │
└─────┘    └─────┘    └─────┘    └─────┘
  ▲                                  ▲
  │                                  │
LRU (evict first)              Most recently freed

分配：

def allocate(self) -> KVCacheBlock:
    """Allocate a block from the free queue."""
    if len(self.free_block_queue) == 0:
        # No free blocks - need to evict a cached block
        block = self._evict_cached_block()
    else:
        # Pop from free queue
        block = self.free_block_queue.popleft()

    block.ref_cnt += 1
    return block

释放：

def free(self, block: KVCacheBlock) -> None:
    """Free a block back to the pool."""
    block.ref_cnt -= 1

    if block.ref_cnt == 0:
        if block.block_hash is not None:
            # Cached block - add to eviction queue
            self.free_block_queue.append(block)
        else:
            # Uncached block - immediately available
            self.free_block_queue.appendleft(block)

block table

每个请求维护一个 block table，将逻辑 block 映射到 physical block：

# Example: Request with 50 tokens, block_size=16
# Needs 4 blocks: [0-15], [16-31], [32-47], [48-49]

block_table = [
    [7],    # Group 0: Logical block 0 → Physical block 7
    [23],   # Group 0: Logical block 1 → Physical block 23
    [102],  # Group 0: Logical block 2 → Physical block 102
    [45]    # Group 0: Logical block 3 → Physical block 45 (partial)
]

attention 计算时的查找：

def get_physical_block(logical_block_idx: int) -> int:
    return block_table[logical_block_idx][0]  # [0] for first group

# Token 37 is in logical block 2 (37 // 16)
# Physical block: block_table[2][0] = 102
# Position in block: 37 % 16 = 5

KVCacheManager

文件位置：vllm/v1/core/kv_cache_manager.py

KVCacheManager 负责协调请求的 block 分配：

class KVCacheManager:
    def __init__(self, kv_cache_config, max_model_len, ...):
        self.block_pool = BlockPool(...)
        self.enable_caching = enable_caching

    def allocate_slots(
        self,
        request: Request,
        num_tokens: int,
        num_computed_tokens: int = 0
    ) -> tuple[KVCacheBlocks, int]:
        """Allocate KV cache blocks for request."""

        # 1. Check prefix cache for hits
        cached_blocks, num_cached_tokens = \
            self.get_computed_blocks(request)

        # 2. Calculate how many new blocks needed
        num_new_tokens = num_tokens - num_cached_tokens
        num_new_blocks = (num_new_tokens + block_size - 1) // block_size

        # 3. Allocate new blocks
        new_blocks = [
            self.block_pool.allocate()
            for _ in range(num_new_blocks)
        ]

        # 4. Combine cached + new blocks
        total_blocks = cached_blocks + new_blocks

        return total_blocks, num_cached_tokens

Prefix Caching：跨请求共享

PagedAttention 最强大的特性之一是 prefix caching——在具有共同前缀的请求之间共享 KV cache block。

block 哈希

每个完整的 block 根据其 token 内容获得一个 hash 值：

def compute_block_hash(
    token_ids: list[int],
    start: int,
    block_size: int
) -> int:
    """Hash a block of tokens."""
    block_tokens = tuple(token_ids[start:start + block_size])

    # Include multimodal features, cache salt, etc.
    return hash((block_tokens, extra_keys...))

示例：

Prompt A: "Translate to French: Hello world"
Tokens: [1, 4521, 284, 1515, 25, 23748, 995]
Block 0: hash([1, 4521, 284, 1515, 25, 23748, 995, ...])

Prompt B: "Translate to French: How are you"
Tokens: [1, 4521, 284, 1515, 25, 1374, 389, 345]
Block 0: hash([1, 4521, 284, 1515, 25, 1374, 389, ...])

# Different hashes - no sharing

但如果有共同的系统提示词：

System: "You are a helpful assistant."
Tokens: [1, 2, 3, ..., 50]  # 50 tokens

Request A: System + "What is AI?"
Request B: System + "What is ML?"

# Both share blocks 0-2 (first 48 tokens of system prompt)
# Only block 3+ differs

cache 查找

当新请求到来时：

def get_computed_blocks(self, request: Request) -> tuple[KVCacheBlocks, int]:
    """Find cached blocks matching request's prefix."""

    cached_blocks = []
    num_cached_tokens = 0

    for block_idx, block_hash in enumerate(request.block_hashes):
        # Try to find a cached block with this hash
        cached_block = self.block_pool.get_cached_block(block_hash)

        if cached_block is None:
            # Cache miss - stop searching
            break

        # Cache hit!
        cached_blocks.append(cached_block)
        cached_block.ref_cnt += 1  # Increment reference
        num_cached_tokens += block_size

    return cached_blocks, num_cached_tokens

reference counting

block 使用 reference counting 实现安全共享：

# Request A uses blocks [7, 23, 102]
block_7.ref_cnt = 1
block_23.ref_cnt = 1
block_102.ref_cnt = 1

# Request B shares block 7 (same prefix)
block_7.ref_cnt = 2  # Now shared!

# Request A finishes - decrement refs
block_7.ref_cnt = 1   # Still in use by B
block_23.ref_cnt = 0  # Can be freed
block_102.ref_cnt = 0 # Can be freed

# Request B finishes
block_7.ref_cnt = 0   # Now can be freed

安全性：只有当 ref_cnt == 0 时，block 才会被释放。

cache 淘汰

当 cache 已满时：

def evict_cached_block(self) -> KVCacheBlock:
    """Evict LRU cached block with ref_cnt == 0."""

    # Scan free queue from left (LRU)
    for block in self.free_block_queue:
        if block.ref_cnt == 0:
            # Found eviction candidate
            self.free_block_queue.remove(block)

            # Remove from cache lookup
            del self.cached_block_hash_to_block[block.block_hash]
            block.block_hash = None

            return block

    raise OutOfMemoryError("No evictable blocks")

使用 block 进行 attention 计算

Prefill 阶段

在 prefill 阶段，我们为整个提示词计算 attention：

def paged_attention_prefill(
    query: Tensor,           # [batch, seq_len, num_heads, head_dim]
    kv_cache: Tensor,        # [num_blocks, 2, num_heads, block_size, head_dim]
    block_tables: Tensor,    # [batch, max_num_blocks]
    context_lens: Tensor     # [batch]
) -> Tensor:
    """
    Compute prefill attention using paged KV cache.
    """

    batch_size, seq_len, num_heads, head_dim = query.shape
    output = torch.empty_like(query)

    for b in range(batch_size):
        context_len = context_lens[b]
        num_blocks = (context_len + block_size - 1) // block_size

        # Gather K, V from blocks
        k_cache, v_cache = [], []
        for block_idx in range(num_blocks):
            physical_block = block_tables[b, block_idx]
            k, v = kv_cache[physical_block]  # [2, num_heads, block_size, head_dim]
            k_cache.append(k)
            v_cache.append(v)

        # Concatenate blocks
        k = torch.cat(k_cache, dim=1)  # [num_heads, context_len, head_dim]
        v = torch.cat(v_cache, dim=1)

        # Standard attention
        scores = query[b] @ k.transpose(-2, -1) / sqrt(head_dim)
        attn = softmax(scores, dim=-1)
        output[b] = attn @ v

    return output

实际上，vLLM 使用经过优化的内核（FlashAttention、FlashInfer），这些内核可以直接使用 block table 进行操作。

Decode 阶段

在 decode 阶段，每个请求生成一个 token：

def paged_attention_decode(
    query: Tensor,           # [batch, 1, num_heads, head_dim]
    kv_cache: Tensor,        # [num_blocks, 2, num_heads, block_size, head_dim]
    block_tables: Tensor,    # [batch, max_num_blocks]
    context_lens: Tensor     # [batch]
) -> Tensor:
    """
    Optimized decode attention - each query attends to full context.
    """

    # Use fused kernel that directly indexes into blocks
    return flash_attn_decode(
        q=query,
        k_cache=kv_cache[:, 0],  # K cache
        v_cache=kv_cache[:, 1],  # V cache
        block_table=block_tables,
        seq_lens=context_lens,
        block_size=block_size
    )

decode 内核经过高度优化：

融合 block 聚合与 attention 计算
warp 级并行
共享内存优化

多组 KV cache

vLLM 支持每层具有不同 cache 需求的模型（MQA、GQA、Mamba）：

kv_cache_config = KVCacheConfig(
    kv_cache_groups=[
        KVCacheGroup(  # Group 0: Attention layers
            kv_cache_spec=AttentionSpec(
                num_heads=32,
                head_dim=128,
                block_size=16
            )
        ),
        KVCacheGroup(  # Group 1: Mamba layers
            kv_cache_spec=MambaSpec(
                state_size=128,
                block_size=16
            )
        )
    ]
)

每个请求从每个组中获取 block：

# Request with 50 tokens
blocks = [
    [7, 23, 102, 45],     # Group 0: Attention cache
    [8, 24, 103, 46]      # Group 1: Mamba cache
]

性能影响

内存效率

使用 PagedAttention 之前（连续分配）：

1
2
3

10 requests × 2048 max length × 512 KB/token = 10.5 GB
Actual usage: 10 requests × 500 avg length = 2.5 GB
Efficiency: 24%

使用 PagedAttention 之后（block 分配）：

1
2
3

500 blocks allocated across all requests
Waste: ~10 blocks (partial last blocks)
Efficiency: 98%

结果：并发请求数提升 4 倍！

吞吐量提升

真实基准测试（Llama-3-8B，运行于 H100）：

指标	未使用 PagedAttention	使用 PagedAttention	提升幅度
并发请求数	12	64	5.3x
吞吐量（tok/s）	1,500	8,000	5.3x
内存占用	60 GB	60 GB	持平
延迟（TTFT）	45ms	42ms	-7%

Prefix Caching 收益

使用共同系统提示词（500 个 token）时：

请求次数	未使用缓存	使用缓存	加速比
第 1 次	10ms（prefill）	10ms	1x
第 2 次	10ms	1ms	10x
第 100 次	10ms	1ms	10x

缓存命中率：对于带有系统提示词的聊天机器人，通常在 60-80% 之间。

进阶主题

滑动窗口 attention

对于像 Mistral 这样使用滑动窗口的模型：

# Only keep last 4096 tokens in cache
if num_tokens > sliding_window_size:
    # Free old blocks outside window
    blocks_to_free = num_tokens - sliding_window_size
    old_blocks = blocks[:blocks_to_free // block_size]
    for block in old_blocks:
        self.block_pool.free(block)

用于 speculative decoding 的 copy-on-write

使用 speculative decoding 时：

# Draft model proposes N tokens
draft_tokens = draft_model.generate(n=5)

# Share blocks with draft (copy-on-write)
draft_blocks = request.blocks  # Same physical blocks!
draft_blocks[-1].ref_cnt += 1  # Increment last block

# Verify with main model
verified = main_model.verify(draft_tokens)

if not all_verified:
    # Rollback - free draft blocks
    for block in draft_blocks[verified_idx:]:
        block.ref_cnt -= 1

实现中的注意事项

hash 碰撞

block 的 hash 值可能发生碰撞：

# Different token sequences with same hash (rare!)
block_a = hash([1, 2, 3, ..., 16]) = 0x123456
block_b = hash([4, 5, 6, ..., 19]) = 0x123456

# Solution: Map hash to list of blocks
cached_blocks[hash] = [block_a, block_b]

不完整的 block

最后一个 block 通常是不完整的：

# 50 tokens with block_size=16
# Block 3 only has 2 tokens (50 % 16 = 2)

# Must track: num_tokens_in_last_block
# For attention: mask out unused positions

线程安全

block 分配必须是线程安全的：

class BlockPool:
    def allocate(self):
        with self.lock:  # Critical section
            return self._allocate_unsafe()

核心要点

PagedAttention 将 KV cache 拆分为固定大小的 block，消除了碎片化和过度分配问题
block table 将逻辑 block 映射到 physical block，类似于虚拟内存的页表
prefix caching 跨请求共享 block，大幅减少了冗余计算
reference counting 确保安全共享，防止过早释放内存
近零内存浪费在实践中实现了 4-5 倍的吞吐量提升
attention 内核经过优化，可直接使用 block 索引存储进行计算

下一步

在第三部分，我们将探索 Scheduler——vLLM 的核心调度器，它负责决定处理哪些请求、何时进行抢占，以及如何通过 continuous batching 最大化 GPU 利用率。

参考资料

PagedAttention 是 vLLM 卓越性能的基础。在下一篇文章中，我们将看到 Scheduler 如何在此之上协调请求的执行。