Part 3: The Scheduler - Brain of vLLM

Introduction

The scheduler is vLLM’s central orchestrator. Every microsecond, it makes critical decisions: which requests to process, how many tokens to compute, when to preempt requests, and how to maximize GPU utilization. This post dives deep into the scheduler’s algorithms and implementation.

The Scheduling Challenge

The Problem Space

At any moment, the scheduler must handle:

  • Waiting requests: New prompts needing their first tokens
  • Running requests: Ongoing generations in decode phase
  • Resource constraints: Limited GPU memory, compute budget
  • Competing objectives: Minimize latency vs. maximize throughput
  • Dynamic workload: Requests arrive and finish asynchronously

No Simple Solution

Unlike traditional batch processing, LLM serving faces unique challenges:

  1. Variable-length requests: Can’t predict completion time
  2. Memory grows with sequence length: Not just compute-bound
  3. Prefill vs decode asymmetry: Prefill takes 100x longer per token
  4. Shared KV cache: Memory decisions affect all requests

Continuous Batching

vLLM’s key innovation: continuous batching (also called iteration-level batching).

Traditional Static Batching

1
2
3
4
5
6
7
8
9
10
11
12
13
# Static batching - wait for batch to fill
batch = []
while len(batch) < batch_size:
batch.append(wait_for_request())

# Process entire batch
outputs = model.forward(batch)

# Wait for ALL requests to finish
while any_request_incomplete(batch):
outputs = model.forward(batch)

# All done - start next batch

Problems:

  • Head-of-line blocking: Fast requests wait for slow ones
  • Low GPU utilization: Batch shrinks as requests finish
  • High latency: Must wait for batch to fill

Continuous Batching

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Continuous batching - add/remove every iteration
running = []

while True:
# Remove finished requests
running = [r for r in running if not r.finished]

# Add new requests if there's capacity
while can_fit_new_request() and waiting_requests:
running.append(waiting_requests.pop())

# Process current batch
outputs = model.forward(running)

# Immediately continue - no waiting!

Benefits:

  • No head-of-line blocking: Requests leave when done
  • Constant GPU utilization: Always max batch size
  • Lower latency: New requests start immediately

Scheduler Architecture

Location: vllm/v1/core/sched/scheduler.py

Core Data Structures

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class Scheduler:
def __init__(self, vllm_config, ...):
# Request storage
self.requests: dict[str, Request] = {}

# Queues
self.waiting: RequestQueue = create_request_queue(policy)
self.running: list[Request] = []
self.skipped_waiting: RequestQueue = create_request_queue(policy)

# Resource managers
self.kv_cache_manager = KVCacheManager(...)
self.encoder_cache_manager = EncoderCacheManager(...)

# Constraints
self.max_num_running_reqs = scheduler_config.max_num_seqs
self.max_num_scheduled_tokens = scheduler_config.max_num_batched_tokens
self.max_model_len = model_config.max_model_len

Request States

Requests flow through states:

1
2
3
4
5
WAITING

RUNNING

FINISHED

With possible transitions:

  • WAITING → RUNNING: Scheduled for first time
  • RUNNING → RUNNING: Continues in batch
  • RUNNING → PREEMPTED: Temporarily removed (if memory tight)
  • PREEMPTED → WAITING: Returns to queue
  • RUNNING → FINISHED: Generation complete

Request Queues

Waiting Queue: New requests waiting to start

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class RequestQueue:
def __init__(self, policy: SchedulingPolicy):
if policy == SchedulingPolicy.FCFS:
# First-Come-First-Serve
self.queue = deque()
elif policy == SchedulingPolicy.PRIORITY:
# Priority-based (higher priority first)
self.queue = [] # heapq

def push(self, request: Request):
if self.policy == SchedulingPolicy.FCFS:
self.queue.append(request)
else:
heapq.heappush(self.queue, (-request.priority, request))

Running List: Requests currently being processed

1
2
# Simply a list - all are processed each iteration
self.running: list[Request] = [req1, req2, ..., req_n]

The Scheduling Algorithm

High-Level Flow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
def schedule(self) -> SchedulerOutput:
"""Main scheduling function called every iteration."""

# 1. Allocate resources to requests
scheduled_new_reqs = []
scheduled_running_reqs = []
preempted_reqs = []

# Token budget for this iteration
token_budget = self.max_num_scheduled_tokens

# 2. Schedule running requests (priority!)
for req in self.running:
num_tokens_to_compute = min(
req.num_tokens - req.num_computed_tokens,
token_budget
)

if num_tokens_to_compute > 0:
# Allocate KV cache blocks
blocks = self.kv_cache_manager.allocate_slots(
req, num_tokens_to_compute
)

if blocks is not None:
scheduled_running_reqs.append(req)
token_budget -= num_tokens_to_compute
else:
# Out of memory - preempt
preempted_reqs.append(req)

# 3. Schedule waiting requests (if budget remains)
while token_budget > 0 and self.waiting:
req = self.waiting.pop()

# Can fit in batch?
if self.can_fit_request(req, token_budget):
scheduled_new_reqs.append(req)
token_budget -= req.num_prompt_tokens
else:
break # Can't fit more

# 4. Build output
return SchedulerOutput(
scheduled_new_reqs=scheduled_new_reqs,
scheduled_running_reqs=scheduled_running_reqs,
preempted_reqs=preempted_reqs,
num_scheduled_tokens=self.max_num_scheduled_tokens - token_budget
)

Step 1: Token Budget

Each iteration has a token budget (typically 2048-8192):

1
2
3
4
5
6
token_budget = self.max_num_scheduled_tokens  # e.g., 4096

# Why limit tokens?
# 1. GPU memory for activations
# 2. Computation time (TTFT)
# 3. CUDA kernel efficiency (not too large)

Trade-off:

  • Larger budget: Higher throughput, higher latency
  • Smaller budget: Lower latency, lower throughput

Step 2: Schedule Running Requests

Running requests get priority:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
for request in self.running:
# How many tokens does this request need?
num_remaining = request.num_tokens - request.num_computed_tokens

if num_remaining == 0:
# Request is finished
self.finish_request(request)
continue

# In decode phase: compute 1 token per iteration
# In prefill phase: may compute multiple tokens
num_to_schedule = min(num_remaining, token_budget)

# Try to allocate KV cache blocks
blocks, num_cached = self.kv_cache_manager.allocate_slots(
request,
num_tokens=request.num_computed_tokens + num_to_schedule,
num_computed_tokens=request.num_computed_tokens
)

if blocks is None:
# Out of KV cache memory!
# Preemption decision
if self.should_preempt(request):
preempted_reqs.append(request)
self.preempt_request(request)
continue

# Successfully scheduled
scheduled_running_reqs.append(request)
request.num_scheduled_tokens = num_to_schedule
token_budget -= num_to_schedule

Key insight: Running requests are usually in decode mode (1 token/iter), so they consume minimal token budget.

Step 3: Schedule Waiting Requests

If token budget remains, admit new requests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
while token_budget > 0 and len(self.waiting) > 0:
request = self.waiting.peek() # Don't pop yet

# Check if request can fit
if not self.can_admit_request(request, token_budget):
break # Can't fit any more

# Pop and schedule
request = self.waiting.pop()

# Find prefix cache hits
cached_blocks, num_cached = \
self.kv_cache_manager.get_computed_blocks(request)

num_to_compute = request.num_prompt_tokens - num_cached

# Allocate blocks for new tokens
blocks = self.kv_cache_manager.allocate_slots(
request,
num_tokens=num_to_compute,
cached_blocks=cached_blocks
)

if blocks is None:
# Out of memory - back to waiting queue
self.waiting.push_front(request)
break

# Scheduled!
scheduled_new_reqs.append(request)
token_budget -= num_to_compute

# Move to running queue
self.running.append(request)

Step 4: Handle Preemptions

When a request is preempted:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def preempt_request(self, request: Request):
"""Preempt a running request to free memory."""

# Free all KV cache blocks
for block in request.blocks:
self.kv_cache_manager.free(block)

# Clear blocks
request.blocks = []
request.num_computed_tokens = 0 # Must recompute!

# Move back to waiting queue
self.running.remove(request)
self.waiting.push(request)

# Increment preemption counter (for logging)
request.num_preemptions += 1

Preemption is expensive: All computation is lost!

Strategy: Preempt requests with least progress to minimize waste.

Advanced Scheduling Strategies

Chunked Prefill

Long prompts are split into chunks:

1
2
3
4
5
6
7
8
9
10
11
# Prompt with 10,000 tokens, chunk_size=2048

# Iteration 1: Compute tokens 0-2047
schedule(request, num_tokens=2048)

# Iteration 2: Compute tokens 2048-4095
schedule(request, num_tokens=2048)

# ... continue until all prefill done

# Then switch to decode mode (1 token/iter)

Benefits:

  • Lower TTFT: First chunk processes faster
  • Fair scheduling: Don’t block all other requests
  • Memory efficient: Spread computation across iterations

Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
if request.num_computed_tokens < request.num_prompt_tokens:
# Still in prefill phase
max_prefill_tokens = min(
self.max_num_scheduled_tokens,
self.chunk_size # e.g., 2048
)
num_to_schedule = min(
request.num_prompt_tokens - request.num_computed_tokens,
max_prefill_tokens
)
else:
# Decode phase - 1 token per iteration
num_to_schedule = 1

Priority Scheduling

Requests can have priorities:

1
2
3
4
5
6
7
8
9
10
11
request_high_priority = Request(
request_id="urgent",
priority=100 # Higher = more important
)

request_low_priority = Request(
request_id="background",
priority=1
)

# Scheduler processes high priority first

Use cases:

  • Paid tiers: Premium users get priority
  • Interactive vs batch: Interactive gets priority
  • Deadlines: Time-sensitive requests boosted

Speculative Decoding Aware

For speculative decoding, schedule draft and target models:

1
2
3
4
5
6
7
8
9
10
11
12
# Draft model proposes k=5 tokens
draft_tokens = draft_model.generate(k=5)

# Main model verifies all k tokens in ONE iteration
verified = main_model.verify(draft_tokens) # Returns 0-5

if verified == 5:
# All accepted! 5x speedup
request.num_tokens += 5
else:
# Partial accept - rollback rest
request.num_tokens += verified

Scheduler must:

  • Allocate KV cache for draft tokens
  • Handle rollbacks when verification fails

Request Reordering

Dynamically reorder to optimize cache hits:

1
2
3
4
5
6
7
8
9
10
11
12
13
def reorder_for_cache_hits(waiting: list[Request]) -> list[Request]:
"""Reorder waiting requests to maximize prefix cache hits."""

# Score each request by expected cache hit rate
scored = [
(estimate_cache_hit_rate(req), req)
for req in waiting
]

# Sort by score (descending)
scored.sort(reverse=True)

return [req for _, req in scored]

Resource Management

KV Cache Allocation

The scheduler must ensure KV cache fits:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def can_admit_request(self, request: Request, token_budget: int) -> bool:
"""Check if request can be admitted."""

# 1. Check token budget
num_tokens_needed = request.num_prompt_tokens
if num_tokens_needed > token_budget:
return False

# 2. Check if KV cache has enough free blocks
num_blocks_needed = (num_tokens_needed + block_size - 1) // block_size
num_free_blocks = self.kv_cache_manager.get_num_free_blocks()

if num_blocks_needed > num_free_blocks:
return False

# 3. Check max running requests
if len(self.running) >= self.max_num_running_reqs:
return False

return True

Memory Pressure Handling

When memory is tight:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
if self.kv_cache_manager.get_num_free_blocks() < MIN_FREE_BLOCKS:
# High memory pressure!

# Option 1: Preempt newest running requests
preempt_candidates = sorted(
self.running,
key=lambda r: r.num_computed_tokens # Preempt least progress
)

for req in preempt_candidates[:NUM_TO_PREEMPT]:
self.preempt_request(req)

# Option 2: Evict cached blocks
self.kv_cache_manager.evict_cached_blocks(NUM_BLOCKS_TO_EVICT)

Encoder Cache Management

For multi-modal models (images, audio):

1
2
3
4
5
6
7
def can_fit_encoder_inputs(self, request: Request) -> bool:
"""Check if encoder cache has capacity."""

num_encoder_tokens = len(request.encoder_input_ids)
available = self.encoder_cache_manager.get_available_capacity()

return num_encoder_tokens <= available

Performance Metrics

Scheduler Stats

The scheduler tracks performance:

1
2
3
4
5
6
7
8
9
10
11
12
13
@dataclass
class SchedulerStats:
num_running: int
num_waiting: int
num_preempted: int

num_scheduled_tokens: int
token_budget: int
utilization: float # scheduled / budget

kv_cache_usage: float # 0.0 - 1.0
num_cache_hits: int
num_cache_misses: int

Optimization Objectives

The scheduler balances multiple objectives:

  1. Maximize throughput: Schedule as many tokens as possible
  2. Minimize latency: Prioritize running requests, use chunked prefill
  3. Maximize cache hits: Reorder requests with common prefixes
  4. Fairness: Don’t starve low-priority requests
  5. Memory efficiency: Minimize preemptions

Example: Scheduling Timeline

Let’s trace a complete scheduling scenario:

Initial State

1
2
3
4
Waiting: [A (100 tokens), B (50 tokens)]
Running: []
Token budget: 200
Free KV blocks: 1000

Iteration 1

1
2
3
4
5
6
7
8
9
10
# Schedule A (new)
A.num_scheduled_tokens = 100
token_budget = 200 - 100 = 100

# Schedule B (new)
B.num_scheduled_tokens = 50
token_budget = 100 - 50 = 50

# Running: [A, B]
# Waiting: []

GPU computes: A prefill (100 tokens) + B prefill (50 tokens)

Iteration 2

1
2
3
4
5
6
7
8
9
10
# Both in decode mode now
A.num_scheduled_tokens = 1
B.num_scheduled_tokens = 1
token_budget = 200 - 2 = 198

# New request C arrives (2000 tokens)
# Can fit with chunked prefill
C.num_scheduled_tokens = 198

# Running: [A, B, C]

GPU computes: A decode (1 token) + B decode (1 token) + C prefill (198 tokens)

Iteration 3

1
2
3
4
5
6
7
8
9
10
# A finishes (generated 50 tokens)
# Free A's KV blocks

# B continues
B.num_scheduled_tokens = 1

# C continues prefill
C.num_scheduled_tokens = 200

# Running: [B, C]

Iteration 10

1
2
3
4
5
6
7
8
9
10
11
# C finishes prefill, now in decode
# New request D (500 tokens) with high priority

# Memory pressure! Preempt B
self.preempt_request(B)

# Schedule D
D.num_scheduled_tokens = 500

# Running: [C, D]
# Waiting: [B]

Scheduling Policies

vLLM supports different policies:

FCFS (First-Come-First-Serve)

1
2
3
# Simple queue
waiting.append(new_request)
next_request = waiting.pop(0)

Pros: Simple, fair
Cons: Head-of-line blocking

Priority-based

1
2
3
# Priority heap
heapq.heappush(waiting, (-priority, request))
priority, next_request = heapq.heappop(waiting)

Pros: Important requests first
Cons: Starvation of low-priority

Shortest-Job-First

1
2
3
# Sort by estimated completion time
waiting.sort(key=lambda r: r.max_tokens)
next_request = waiting.pop(0)

Pros: Minimizes average latency
Cons: Long requests starve

Debugging Scheduler Issues

Common problems and solutions:

High Preemption Rate

1
2
3
4
5
6
# Symptom: Many requests restarting
if stats.num_preempted > THRESHOLD:
# Solutions:
# 1. Increase KV cache size
# 2. Decrease max_num_seqs
# 3. Enable chunked prefill

Low GPU Utilization

1
2
3
4
5
# Symptom: token_budget not fully used
if stats.utilization < 0.8:
# Solutions:
# 1. Increase max_num_seqs
# 2. Reduce chunk_size (more parallelism)

High Latency

1
2
3
4
5
6
# Symptom: TTFT too high
if avg_ttft > TARGET_TTFT:
# Solutions:
# 1. Enable chunked prefill
# 2. Prioritize interactive requests
# 3. Reduce max_num_scheduled_tokens

Key Takeaways

  1. Continuous batching eliminates head-of-line blocking by adding/removing requests every iteration

  2. Token budget controls batch size and latency/throughput trade-off

  3. Running requests get priority to minimize decode latency

  4. Preemption is a last resort when memory is tight, discarding all progress

  5. Chunked prefill balances long prompts with interactive requests

  6. Prefix caching integration requires scheduler awareness for efficient admission

What’s Next

In Part 4, we’ll explore Request Processing - how requests flow through the system from tokenization to streaming output, and how state is managed across iterations.

References


The scheduler is vLLM’s brain, making thousands of split-second decisions to maximize GPU utilization while meeting latency SLAs. Next, we’ll see how requests actually flow through the system end-to-end.

Part 2: PagedAttention - The Core Innovation

Introduction

PagedAttention is vLLM’s breakthrough innovation that revolutionized LLM serving. Before PagedAttention, serving LLMs efficiently was plagued by memory fragmentation and waste. This post dives deep into how PagedAttention works, why it matters, and how it’s implemented in vLLM.

The Memory Problem in LLM Serving

Understanding KV Cache

In transformer models, attention layers compute key (K) and value (V) vectors for each token. During generation:

  1. Prefill phase: Compute K, V for all prompt tokens
  2. Decode phase: For each new token, compute its K, V and attend to all previous K, V

The challenge: We must store all previous K, V tensors (the “KV cache”) to avoid recomputation.

Memory size: For Llama-3-8B:

  • 32 attention layers
  • 4096 hidden dimensions
  • FP16 precision (2 bytes)
  • Per token: 32 layers × 4096 dim × 2 (K,V) × 2 bytes = 512 KB

For a 2048-token sequence: 1 GB just for KV cache!

The Fragmentation Problem

Traditional serving systems pre-allocate contiguous memory for each request’s maximum length:

1
2
3
4
5
6
7
8
Request 1 (max 2048 tokens, actual 512):
[████░░░░░░░░░░░░] 75% wasted

Request 2 (max 2048 tokens, actual 1800):
[████████████████] 12% wasted

Request 3 (max 2048 tokens, actual 200):
[██░░░░░░░░░░░░░░] 90% wasted

Problems:

  1. Over-allocation: Must allocate for worst-case length
  2. Fragmentation: Can’t use unused space from one request for another
  3. Low Throughput: Memory is the bottleneck, not compute

Real impact: Without PagedAttention, you might serve 10 concurrent requests. With PagedAttention, you can serve 50+ requests with the same memory!

The PagedAttention Solution

PagedAttention applies virtual memory concepts to attention computation:

Key idea: Instead of storing KV cache contiguously, break it into fixed-size blocks. Each request gets a block table mapping logical positions to physical blocks.

Virtual Memory Analogy

Virtual Memory (OS) PagedAttention (vLLM)
Page Block (e.g., 16 tokens)
Page Table Block Table
Physical Memory GPU Memory
Memory Allocator Block Pool
Page Fault Cache Miss
Copy-on-Write Prefix Sharing

Block-Based Storage

Instead of contiguous allocation:

1
2
3
4
5
6
7
8
9
10
11
Request 1 (512 tokens, 32 blocks):
Blocks: [0, 3, 7, 12, ..., 89]
Scattered in GPU memory

Request 2 (1800 tokens, 113 blocks):
Blocks: [1, 2, 4, 5, ..., 105]
Can reuse freed blocks from other requests

Request 3 (200 tokens, 13 blocks):
Blocks: [6, 8, 9, ..., 18]
Minimal waste (only last block partially filled)

Advantages:

  1. No over-allocation: Allocate blocks as needed
  2. No fragmentation: Any free block can go to any request
  3. Near-zero waste: Only last block of each request may be partially filled
  4. Sharing: Common prefixes share physical blocks

Implementation Deep Dive

Block Structure

Location: vllm/v1/core/kv_cache_utils.py

1
2
3
4
5
6
class KVCacheBlock:
def __init__(self, block_id: int):
self.block_id = block_id # Physical block ID
self.ref_cnt = 0 # Reference count
self.block_hash: int | None = None # Hash for prefix caching
self.is_null = False # Special null block

Each block represents a fixed number of token positions (block_size, typically 16).

Physical storage:

1
2
3
4
5
6
# In GPU memory, shape: [num_blocks, num_heads, block_size, head_dim]
kv_cache = torch.empty(
(num_blocks, 2, num_heads, block_size, head_dim),
dtype=torch.float16,
device='cuda'
)

Block Pool

Location: vllm/v1/core/block_pool.py

The BlockPool manages all KV cache blocks:

1
2
3
4
5
6
7
8
9
10
11
12
class BlockPool:
def __init__(self, num_gpu_blocks: int, enable_caching: bool, ...):
# All blocks in the pool
self.blocks: list[KVCacheBlock] = [
KVCacheBlock(idx) for idx in range(num_gpu_blocks)
]

# Free block queue (doubly-linked list)
self.free_block_queue = FreeKVCacheBlockQueue(self.blocks)

# Cache: block_hash -> KVCacheBlock
self.cached_block_hash_to_block: BlockHashToBlockMap = {}

Free Block Queue: Implements LRU eviction for cached blocks:

1
2
3
4
5
6
7
free_block_queue:
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ 42 │◄──►│ 17 │◄──►│ 99 │◄──►│ 3 │
└─────┘ └─────┘ └─────┘ └─────┘
▲ ▲
│ │
LRU (evict first) Most recently freed

Allocation:

1
2
3
4
5
6
7
8
9
10
11
def allocate(self) -> KVCacheBlock:
"""Allocate a block from the free queue."""
if len(self.free_block_queue) == 0:
# No free blocks - need to evict a cached block
block = self._evict_cached_block()
else:
# Pop from free queue
block = self.free_block_queue.popleft()

block.ref_cnt += 1
return block

Deallocation:

1
2
3
4
5
6
7
8
9
10
11
def free(self, block: KVCacheBlock) -> None:
"""Free a block back to the pool."""
block.ref_cnt -= 1

if block.ref_cnt == 0:
if block.block_hash is not None:
# Cached block - add to eviction queue
self.free_block_queue.append(block)
else:
# Uncached block - immediately available
self.free_block_queue.appendleft(block)

Block Table

Each request maintains a block table mapping logical to physical blocks:

1
2
3
4
5
6
7
8
9
# Example: Request with 50 tokens, block_size=16
# Needs 4 blocks: [0-15], [16-31], [32-47], [48-49]

block_table = [
[7], # Group 0: Logical block 0 → Physical block 7
[23], # Group 0: Logical block 1 → Physical block 23
[102], # Group 0: Logical block 2 → Physical block 102
[45] # Group 0: Logical block 3 → Physical block 45 (partial)
]

Lookup during attention:

1
2
3
4
5
6
def get_physical_block(logical_block_idx: int) -> int:
return block_table[logical_block_idx][0] # [0] for first group

# Token 37 is in logical block 2 (37 // 16)
# Physical block: block_table[2][0] = 102
# Position in block: 37 % 16 = 5

KVCacheManager

Location: vllm/v1/core/kv_cache_manager.py

The KVCacheManager orchestrates block allocation for requests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class KVCacheManager:
def __init__(self, kv_cache_config, max_model_len, ...):
self.block_pool = BlockPool(...)
self.enable_caching = enable_caching

def allocate_slots(
self,
request: Request,
num_tokens: int,
num_computed_tokens: int = 0
) -> tuple[KVCacheBlocks, int]:
"""Allocate KV cache blocks for request."""

# 1. Check prefix cache for hits
cached_blocks, num_cached_tokens = \
self.get_computed_blocks(request)

# 2. Calculate how many new blocks needed
num_new_tokens = num_tokens - num_cached_tokens
num_new_blocks = (num_new_tokens + block_size - 1) // block_size

# 3. Allocate new blocks
new_blocks = [
self.block_pool.allocate()
for _ in range(num_new_blocks)
]

# 4. Combine cached + new blocks
total_blocks = cached_blocks + new_blocks

return total_blocks, num_cached_tokens

Prefix Caching: Sharing Across Requests

One of PagedAttention’s killer features is prefix caching - sharing KV cache blocks across requests with common prefixes.

Block Hashing

Each full block gets a hash based on its token content:

1
2
3
4
5
6
7
8
9
10
def compute_block_hash(
token_ids: list[int],
start: int,
block_size: int
) -> int:
"""Hash a block of tokens."""
block_tokens = tuple(token_ids[start:start + block_size])

# Include multimodal features, cache salt, etc.
return hash((block_tokens, extra_keys...))

Example:

1
2
3
4
5
6
7
8
9
Prompt A: "Translate to French: Hello world"
Tokens: [1, 4521, 284, 1515, 25, 23748, 995]
Block 0: hash([1, 4521, 284, 1515, 25, 23748, 995, ...])

Prompt B: "Translate to French: How are you"
Tokens: [1, 4521, 284, 1515, 25, 1374, 389, 345]
Block 0: hash([1, 4521, 284, 1515, 25, 1374, 389, ...])

# Different hashes - no sharing

But with a common system prompt:

1
2
3
4
5
6
7
8
System: "You are a helpful assistant."
Tokens: [1, 2, 3, ..., 50] # 50 tokens

Request A: System + "What is AI?"
Request B: System + "What is ML?"

# Both share blocks 0-2 (first 48 tokens of system prompt)
# Only block 3+ differs

Cache Lookup

When a new request arrives:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def get_computed_blocks(self, request: Request) -> tuple[KVCacheBlocks, int]:
"""Find cached blocks matching request's prefix."""

cached_blocks = []
num_cached_tokens = 0

for block_idx, block_hash in enumerate(request.block_hashes):
# Try to find a cached block with this hash
cached_block = self.block_pool.get_cached_block(block_hash)

if cached_block is None:
# Cache miss - stop searching
break

# Cache hit!
cached_blocks.append(cached_block)
cached_block.ref_cnt += 1 # Increment reference
num_cached_tokens += block_size

return cached_blocks, num_cached_tokens

Reference Counting

Blocks use reference counting for safe sharing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Request A uses blocks [7, 23, 102]
block_7.ref_cnt = 1
block_23.ref_cnt = 1
block_102.ref_cnt = 1

# Request B shares block 7 (same prefix)
block_7.ref_cnt = 2 # Now shared!

# Request A finishes - decrement refs
block_7.ref_cnt = 1 # Still in use by B
block_23.ref_cnt = 0 # Can be freed
block_102.ref_cnt = 0 # Can be freed

# Request B finishes
block_7.ref_cnt = 0 # Now can be freed

Safety: A block is only freed when ref_cnt == 0.

Cache Eviction

When the cache is full:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def evict_cached_block(self) -> KVCacheBlock:
"""Evict LRU cached block with ref_cnt == 0."""

# Scan free queue from left (LRU)
for block in self.free_block_queue:
if block.ref_cnt == 0:
# Found eviction candidate
self.free_block_queue.remove(block)

# Remove from cache lookup
del self.cached_block_hash_to_block[block.block_hash]
block.block_hash = None

return block

raise OutOfMemoryError("No evictable blocks")

Attention Computation with Blocks

Prefill Phase

During prefill, we compute attention for the entire prompt:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def paged_attention_prefill(
query: Tensor, # [batch, seq_len, num_heads, head_dim]
kv_cache: Tensor, # [num_blocks, 2, num_heads, block_size, head_dim]
block_tables: Tensor, # [batch, max_num_blocks]
context_lens: Tensor # [batch]
) -> Tensor:
"""
Compute prefill attention using paged KV cache.
"""

batch_size, seq_len, num_heads, head_dim = query.shape
output = torch.empty_like(query)

for b in range(batch_size):
context_len = context_lens[b]
num_blocks = (context_len + block_size - 1) // block_size

# Gather K, V from blocks
k_cache, v_cache = [], []
for block_idx in range(num_blocks):
physical_block = block_tables[b, block_idx]
k, v = kv_cache[physical_block] # [2, num_heads, block_size, head_dim]
k_cache.append(k)
v_cache.append(v)

# Concatenate blocks
k = torch.cat(k_cache, dim=1) # [num_heads, context_len, head_dim]
v = torch.cat(v_cache, dim=1)

# Standard attention
scores = query[b] @ k.transpose(-2, -1) / sqrt(head_dim)
attn = softmax(scores, dim=-1)
output[b] = attn @ v

return output

In practice, vLLM uses optimized kernels (FlashAttention, FlashInfer) that work directly with block tables.

Decode Phase

During decode, each request generates one token:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def paged_attention_decode(
query: Tensor, # [batch, 1, num_heads, head_dim]
kv_cache: Tensor, # [num_blocks, 2, num_heads, block_size, head_dim]
block_tables: Tensor, # [batch, max_num_blocks]
context_lens: Tensor # [batch]
) -> Tensor:
"""
Optimized decode attention - each query attends to full context.
"""

# Use fused kernel that directly indexes into blocks
return flash_attn_decode(
q=query,
k_cache=kv_cache[:, 0], # K cache
v_cache=kv_cache[:, 1], # V cache
block_table=block_tables,
seq_lens=context_lens,
block_size=block_size
)

The decode kernel is highly optimized:

  • Fused block gathering and attention
  • Warp-level parallelism
  • Shared memory optimization

Multi-Group KV Cache

vLLM supports models with different cache requirements per layer (MQA, GQA, Mamba):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
kv_cache_config = KVCacheConfig(
kv_cache_groups=[
KVCacheGroup( # Group 0: Attention layers
kv_cache_spec=AttentionSpec(
num_heads=32,
head_dim=128,
block_size=16
)
),
KVCacheGroup( # Group 1: Mamba layers
kv_cache_spec=MambaSpec(
state_size=128,
block_size=16
)
)
]
)

Each request gets blocks from each group:

1
2
3
4
5
# Request with 50 tokens
blocks = [
[7, 23, 102, 45], # Group 0: Attention cache
[8, 24, 103, 46] # Group 1: Mamba cache
]

Performance Impact

Memory Efficiency

Before PagedAttention (contiguous allocation):

1
2
3
10 requests × 2048 max length × 512 KB/token = 10.5 GB
Actual usage: 10 requests × 500 avg length = 2.5 GB
Efficiency: 24%

With PagedAttention (block allocation):

1
2
3
500 blocks allocated across all requests
Waste: ~10 blocks (partial last blocks)
Efficiency: 98%

Result: 4x more concurrent requests!

Throughput Gains

Real benchmarks (Llama-3-8B on H100):

Metric Without PagedAttention With PagedAttention Improvement
Concurrent Requests 12 64 5.3x
Throughput (tok/s) 1,500 8,000 5.3x
Memory Usage 60 GB 60 GB Same
Latency (TTFT) 45ms 42ms -7%

Prefix Caching Benefits

With a common system prompt (500 tokens):

Requests Without Caching With Caching Speedup
1st 10ms (prefill) 10ms 1x
2nd 10ms 1ms 10x
100th 10ms 1ms 10x

Cache hit rate: Typically 60-80% for chatbots with system prompts.

Advanced Topics

Sliding Window Attention

For models like Mistral with sliding windows:

1
2
3
4
5
6
7
# Only keep last 4096 tokens in cache
if num_tokens > sliding_window_size:
# Free old blocks outside window
blocks_to_free = num_tokens - sliding_window_size
old_blocks = blocks[:blocks_to_free // block_size]
for block in old_blocks:
self.block_pool.free(block)

Copy-on-Write for Speculative Decoding

When using speculative decoding:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Draft model proposes N tokens
draft_tokens = draft_model.generate(n=5)

# Share blocks with draft (copy-on-write)
draft_blocks = request.blocks # Same physical blocks!
draft_blocks[-1].ref_cnt += 1 # Increment last block

# Verify with main model
verified = main_model.verify(draft_tokens)

if not all_verified:
# Rollback - free draft blocks
for block in draft_blocks[verified_idx:]:
block.ref_cnt -= 1

Implementation Gotchas

Hash Collisions

Block hashes can collide:

1
2
3
4
5
6
# Different token sequences with same hash (rare!)
block_a = hash([1, 2, 3, ..., 16]) = 0x123456
block_b = hash([4, 5, 6, ..., 19]) = 0x123456

# Solution: Map hash to list of blocks
cached_blocks[hash] = [block_a, block_b]

Partial Blocks

Last block is often partial:

1
2
3
4
5
# 50 tokens with block_size=16
# Block 3 only has 2 tokens (50 % 16 = 2)

# Must track: num_tokens_in_last_block
# For attention: mask out unused positions

Thread Safety

Block allocation must be thread-safe:

1
2
3
4
class BlockPool:
def allocate(self):
with self.lock: # Critical section
return self._allocate_unsafe()

Key Takeaways

  1. PagedAttention breaks KV cache into fixed-size blocks, eliminating fragmentation and over-allocation

  2. Block tables map logical to physical blocks, similar to virtual memory page tables

  3. Prefix caching shares blocks across requests, dramatically reducing redundant computation

  4. Reference counting ensures safe sharing, preventing premature deallocation

  5. Near-zero memory waste enables 4-5x higher throughput in practice

  6. Attention kernels are optimized to work directly with block-indexed storage

What’s Next

In Part 3, we’ll explore the Scheduler - vLLM’s brain that decides which requests to process, when to preempt, and how to maximize GPU utilization using continuous batching.

References


PagedAttention is the foundation that makes vLLM’s exceptional performance possible. In the next post, we’ll see how the Scheduler builds on top of this to orchestrate request execution.

Part 1: vLLM Architecture Overview

Introduction

Before diving into specific components, we need to understand vLLM’s overall architecture. This post maps out the major components and how they interact, providing the foundation for deeper exploration in subsequent parts.

The 30,000-Foot View

vLLM is designed around a few key principles:

  1. Separation of Concerns: Different aspects (scheduling, execution, serving) are handled by distinct components
  2. Process Isolation: V1 architecture uses multiple processes for robustness and CPU efficiency
  3. Asynchronous Processing: Requests flow through pipelines without blocking
  4. Distributed by Default: Built from the ground up to support multi-GPU execution

High-Level Data Flow

Here’s what happens when you send a request to vLLM:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
User Request (HTTP/gRPC)

API Server Process
↓ (ZMQ Socket)
Engine Core Process

Scheduler

GPU Worker Processes
↓ (NCCL/Collective Ops)
Model Execution

Output Processing
↓ (ZMQ Socket)
API Server Process

Streaming Response to User

V1 Multi-Process Architecture

The V1 architecture (introduced in late 2024) uses a multi-process design for better CPU utilization and isolation. Let’s understand each process type:

1. API Server Process

Purpose: Handle HTTP/gRPC requests and I/O operations

Responsibilities:

  • Receive and validate HTTP requests
  • Tokenize input text
  • Load multi-modal data (images, audio)
  • Stream outputs back to clients
  • Handle API authentication and rate limiting

Key Implementation: vllm/entrypoints/openai/api_server.py

The API server is stateless with respect to model execution. It doesn’t know about GPU memory, KV caches, or model weights. It simply:

  1. Converts user requests to EngineCoreRequest objects
  2. Sends them to engine cores via ZMQ
  3. Receives EngineCoreOutput objects back
  4. Converts them to API responses

Process Count: By default, 1 API server. With data parallelism (--data-parallel-size N), automatically scales to N API servers.

CPU Threads: Uses VLLM_MEDIA_LOADING_THREAD_COUNT threads (default 8) for parallel media loading.

2. Engine Core Process

Purpose: Schedule requests and coordinate model execution

Responsibilities:

  • Maintain the request queue
  • Run the scheduler to decide what to compute
  • Manage KV cache allocation
  • Coordinate GPU workers
  • Handle request preemption and swapping

Key Implementation: vllm/v1/engine/core.py

The engine core runs a tight loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
while True:
# 1. Receive new requests from API servers
new_requests = receive_requests()

# 2. Schedule which requests to process
scheduler_output = scheduler.schedule()

# 3. Dispatch work to GPU workers
dispatch_to_workers(scheduler_output)

# 4. Collect outputs and send back to API servers
outputs = collect_outputs()
send_outputs(outputs)

Process Count: One per data parallel rank. For --data-parallel-size 4, you get 4 engine cores.

CPU Usage: Runs a busy loop for low-latency scheduling decisions.

3. GPU Worker Processes

Purpose: Execute model forward passes on GPUs

Responsibilities:

  • Load model weights onto GPU
  • Execute forward passes
  • Manage GPU memory
  • Run CUDA kernels (attention, FFN, etc.)
  • Participate in collective operations (for tensor/pipeline parallelism)

Key Implementation: vllm/v1/worker/gpu_worker.py

Each GPU gets its own worker process. The worker:

  1. Loads its shard of model weights
  2. Receives execution requests from its engine core
  3. Runs the model forward pass
  4. Returns outputs (logits, KV cache updates)

Process Count: One per GPU. For 8 GPUs with --tensor-parallel-size 4 --data-parallel-size 2:

  • 8 GPU workers total
  • 2 groups of 4 (tensor parallel groups)
  • 2 engine cores (one per data parallel rank)

4. DP Coordinator Process (Conditional)

Purpose: Coordinate data parallel engines

Responsibilities:

  • Load balance across data parallel ranks
  • Synchronize MoE model execution
  • Route requests to least-loaded engine

Key Implementation: vllm/v1/engine/coordinator.py

Process Count: 1 if --data-parallel-size > 1, otherwise 0.

Process Count Examples

Let’s see some concrete examples:

Example 1: Single GPU

1
vllm serve meta-llama/Llama-3-8B

Processes:

  • 1 API Server
  • 1 Engine Core
  • 1 GPU Worker
  • Total: 3 processes

Example 2: Tensor Parallelism (4 GPUs)

1
vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 4

Processes:

  • 1 API Server
  • 1 Engine Core
  • 4 GPU Workers (one per GPU)
  • Total: 6 processes

Example 3: Data Parallelism (4 GPUs)

1
vllm serve meta-llama/Llama-3-8B --data-parallel-size 4

Processes:

  • 4 API Servers (scales with DP)
  • 4 Engine Cores (one per DP rank)
  • 4 GPU Workers (one per GPU)
  • 1 DP Coordinator
  • Total: 13 processes

Example 4: Mixed Parallelism (8 GPUs)

1
vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 2 --data-parallel-size 4

Processes:

  • 4 API Servers
  • 4 Engine Cores (one per DP rank)
  • 8 GPU Workers (2 per DP rank, one per GPU)
  • 1 DP Coordinator
  • Total: 17 processes

Core Components Deep Dive

LLMEngine

The LLMEngine class is the main entry point for offline inference (using the Python API directly).

Location: vllm/v1/engine/llm_engine.py

Key responsibilities:

  • Create and manage the engine core
  • Process inputs via InputProcessor
  • Convert outputs via OutputProcessor
  • Handle request lifecycle

Usage example:

1
2
3
4
5
6
7
8
9
10
from vllm import LLM, SamplingParams

# Initialize engine
llm = LLM(model="meta-llama/Llama-3-8B")

# Generate
outputs = llm.generate(
["Hello, my name is"],
SamplingParams(temperature=0.8, top_p=0.95)
)

Under the hood, LLM creates an LLMEngine, which creates an EngineCore, which coordinates the workers.

Scheduler

The scheduler is the brain of vLLM. It decides:

  • Which requests to process in each iteration
  • How many tokens to compute for each request
  • When to preempt requests (and which ones)
  • How to allocate KV cache blocks

Location: vllm/v1/core/sched/scheduler.py

The scheduler maintains three queues:

  1. Waiting: New requests waiting to start
  2. Running: Requests being actively processed
  3. Skipped Waiting: Requests temporarily skipped due to dependencies

Scheduling algorithm (simplified):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def schedule(self) -> SchedulerOutput:
scheduled_requests = []
token_budget = max_num_scheduled_tokens

# First, schedule running requests (priority)
for req in running:
if token_budget > 0:
num_tokens = min(req.remaining_tokens, token_budget)
scheduled_requests.append((req, num_tokens))
token_budget -= num_tokens

# Then, schedule waiting requests
for req in waiting:
if token_budget >= req.num_prompt_tokens:
scheduled_requests.append((req, req.num_prompt_tokens))
token_budget -= req.num_prompt_tokens
else:
break # Not enough budget

return SchedulerOutput(scheduled_requests)

We’ll explore the scheduler in depth in Part 3.

KV Cache Manager

Manages memory for attention key-value caches using PagedAttention.

Location: vllm/v1/core/kv_cache_manager.py

Key concepts:

  • Block: Fixed-size chunk of KV cache (e.g., 16 tokens)
  • Block Pool: Pre-allocated pool of blocks
  • Block Table: Mapping from logical positions to physical blocks
  • Prefix Cache: Shared blocks for common prompt prefixes

Example: If block size is 16 and we have 1000 blocks, we can serve:

  • 1 request with 16,000 tokens, or
  • 16 requests with 1,000 tokens each, or
  • Any combination that fits in 1000 blocks

We’ll dive deep into PagedAttention in Part 2.

Worker and ModelRunner

The worker process loads the model and executes forward passes.

Worker (vllm/v1/worker/gpu_worker.py):

  • Manages GPU device
  • Initializes model weights
  • Coordinates with other workers (for TP/PP)

ModelRunner (varies by backend):

  • Prepares input tensors
  • Executes model forward pass
  • Applies CUDA graph optimization
  • Returns output logits

Request Lifecycle

Let’s trace a complete request through the system:

1. Request Arrives

1
2
3
4
5
6
POST /v1/completions
{
"model": "meta-llama/Llama-3-8B",
"prompt": "The capital of France is",
"max_tokens": 50
}

2. API Server Processing

  • Validate request
  • Tokenize prompt: [791, 3139, 315, 9822, 374]
  • Create EngineCoreRequest object
  • Send to engine core via ZMQ

3. Engine Core Receives

  • Add request to scheduler’s waiting queue
  • Assign request ID: "req_abc123"
  • Initialize request state

4. First Scheduling Iteration

  • Scheduler sees new request in waiting queue
  • Check KV cache: 5 tokens need computation
  • Allocate KV cache blocks (1 block for 16-token block size)
  • Schedule for prefill: compute all 5 tokens

5. Worker Execution

  • Receive schedule from engine core
  • Prepare input tensors
  • Run model forward pass
  • Generate first token (e.g., token ID 330 = “Paris”)
  • Return logits to engine core

6. Output Processing

  • Engine core receives logits
  • Sample next token: 330 (“Paris”)
  • Update request state: add token to output
  • Send output to API server via ZMQ

7. API Server Streaming

  • Receive token from engine core
  • Detokenize: “Paris”
  • Stream to client: data: {"text": "Paris", "finish_reason": null}

8. Subsequent Iterations

  • Request moves to running queue
  • Scheduler continues allocating tokens
  • Each iteration: compute 1 new token (decode phase)
  • Stream each token to client

9. Request Completion

  • Either max_tokens reached or EOS generated
  • Free KV cache blocks (or keep in prefix cache)
  • Send final completion to client
  • Remove request from scheduler

Configuration and Initialization

vLLM’s configuration is centralized in VllmConfig:

1
2
3
4
5
6
7
8
9
from vllm.config import VllmConfig

vllm_config = VllmConfig(
model_config=ModelConfig(...),
cache_config=CacheConfig(...),
scheduler_config=SchedulerConfig(...),
parallel_config=ParallelConfig(...),
# ... more configs
)

All components receive the same VllmConfig object, ensuring consistency.

Key configs:

  • ModelConfig: Model name, dtype, tokenizer
  • CacheConfig: KV cache size, block size, prefix caching
  • SchedulerConfig: Max batch size, scheduling policy
  • ParallelConfig: TP/PP/DP sizes, distributed backend

Class Hierarchy

The class hierarchy follows a consistent pattern:

1
2
3
4
5
6
7
8
9
10
11
12
LLMEngine
├── InputProcessor (tokenization, preprocessing)
├── EngineCore
│ ├── Scheduler
│ │ ├── KVCacheManager
│ │ ├── EncoderCacheManager
│ │ └── StructuredOutputManager
│ └── Executor
│ └── Workers (one per GPU)
│ └── ModelRunner
│ └── Model (nn.Module)
└── OutputProcessor (detokenization, streaming)

Every class accepts VllmConfig in its constructor, providing access to all configuration options.

Inter-Process Communication

ZMQ Sockets

API servers and engine cores communicate via ZMQ:

1
2
3
4
5
# API Server side
zmq_socket.send(pickle.dumps(request))

# Engine Core side
request = pickle.loads(zmq_socket.recv())

Why ZMQ?

  • High performance (microsecond latency)
  • Flexible patterns (req-rep, pub-sub, push-pull)
  • Built-in queue management
  • Language agnostic

NCCL for GPU Communication

GPU workers use NCCL for collective operations:

1
2
3
4
5
6
# Tensor parallelism: all-reduce across GPUs
torch.distributed.all_reduce(
tensor,
op=torch.distributed.ReduceOp.SUM,
group=tensor_parallel_group
)

Communication patterns:

  • Tensor Parallel: All-reduce after attention/FFN
  • Pipeline Parallel: Send/recv between stages
  • Data Parallel: No communication during forward pass

Memory Layout

Understanding memory is crucial for vLLM:

GPU Memory Breakdown

1
2
3
4
5
Total GPU Memory: 80GB (H100)
├── Model Weights: 16GB (Llama-3-8B in FP16)
├── KV Cache: 60GB (dynamically allocated blocks)
├── Activation Memory: 2GB (for batch processing)
└── Framework Overhead: 2GB (PyTorch, CUDA)

KV Cache Memory

For a 16-token block size with 8B model:

  • Each block stores K and V for 32 attention layers
  • Per block: 16 tokens × 32 layers × 2 (K,V) × 4096 dim × 2 bytes = 8.4 MB
  • With 60GB for KV cache: ~7,100 blocks
  • Total capacity: ~113,600 tokens across all requests

Performance Characteristics

Typical performance on H100 GPU:

Metric Value
Throughput ~8,000 tokens/sec (Llama-3-8B)
Latency (TTFT) ~20-50ms
Latency (TPOT) ~10-15ms
Max Batch Size ~256 concurrent requests
Memory Efficiency ~95% (vs ~60% without PagedAttention)

What’s Next

Now that we understand the overall architecture, we can dive deeper into specific components:

  • Part 2: PagedAttention and KV cache management
  • Part 3: The scheduler’s decision-making process
  • Part 4: Request processing and state management
  • Part 5: Distributed execution and parallelism

Key Takeaways

  1. V1 uses multi-process architecture for better CPU utilization and fault isolation
  2. Scheduler is the central coordinator, deciding what to compute each iteration
  3. KV cache management (PagedAttention) is key to memory efficiency
  4. Process count scales with parallelism: typically API servers + Engine cores + GPU workers + (optional DP coordinator)
  5. ZMQ enables efficient inter-process communication
  6. All components share a unified VllmConfig for consistency

References


In the next post, we’ll explore PagedAttention in detail, understanding how vLLM achieves near-zero memory waste through clever memory management.

vLLM Deep Dive Series: Understanding Modern LLM Serving

Series Overview

This series provides a comprehensive technical deep-dive into vLLM, one of the most important open-source projects for LLM inference and serving. Originally developed at UC Berkeley’s Sky Computing Lab, vLLM has become the de facto standard for high-performance LLM serving in production environments.

Read more »

引言

prow-images 仓库是基于 Kubernetes Prow 构建的复杂 CI/CD 基础设施的核心组件。它作为专用容器镜像的集中式集合,为持续集成和交付流水线的各个方面提供动力。本文将深入探讨 prow-images 生态系统的架构、组件和工作流程,特别关注它与 prow-configs 仓库和 manual-trigger 服务之间的关系。

什么是 prow-images?

prow-images 仓库是一个包含超过 35 个不同专用容器镜像的单体仓库(monorepo),每个镜像都设计用于处理 Prow 作业中的特定任务。这些镜像从基本实用工具(如 Git 操作)到复杂工具(如 E2E 测试框架、Kubernetes 集群配置和自动安全 PR 生成)应有尽有。

仓库结构

仓库中的每个组件都遵循一致的结构:

  • 用于构建容器镜像的 Dockerfile
  • 跟踪当前版本的 VERSION 文件(例如 v0.0.1
  • 包含基于 Go 的主应用程序的 entrypoint 目录
  • 组件特定的 README 文档

根目录的 Makefile 负责协调所有镜像的构建和推送到中央镜像仓库 hub.tess.io/prowimages/

核心组件

让我们深入了解组成这个生态系统的一些关键组件:

1. CI Generator - 配置自动化引擎

CI Generator 是生态系统中最关键的组件之一。它从简化的清单文件自动生成 Prow 作业规范。

主要特性:

  • 从 prow-configs 仓库读取 .manifest 文件
  • 支持多种作业生成类型:BuildUnitTest
  • 自动生成 presubmit 和 postsubmit 作业配置
  • 处理与 Kaniko 集成的复杂构建场景

工作原理:

  1. 开发者在 prow-configs 中的仓库作业目录中创建 ci.manifest 文件
  2. CI Generator 读取这些清单文件并生成完整的 Prow 作业 YAML 规范
  3. 生成的文件会自动标记头部信息:”此文件由 ci generator 自动生成,请勿手动编辑”
  4. 作业可以配置不同的触发器:PR 时 (onPr)、标签时 (onTag),支持正则表达式模式

清单示例片段:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
tess/maintenance:
- jobGenType: Build
name: build-maintenance-controller
branch: master
dockerFile: Dockerfile
versionFile: VERSION
buildTime: kaniko
targets:
- onPr:
imageTags:
- hub.tess.io/maintenance/maintenance-controller:pr-${PULL_NUMBER}
- onTag:
tagRegex:
- ^v(\d)+\.(\d)+\.(\d)+
imageTags:
- hub.tess.io/maintenance/maintenance-controller:${PULL_BASE_REF}

2. Kaniko - 安全的容器构建

Kaniko 镜像包装器提供了一种在 Kubernetes Pod 中安全构建容器镜像的方式,无需访问 Docker 守护进程。

功能:

  • 支持可配置深度的 Git 仓库克隆
  • 支持 git-crypt 加密仓库
  • 多个目标镜像仓库
  • 构建参数和标签
  • 镜像仓库镜像支持
  • TLS 验证选项
  • 自动向 GitHub PR 发送构建结果评论
  • 构建后命令执行

在 Prow 作业中的使用模式:

1
2
3
4
5
6
7
8
spec:
containers:
- image: hub.tess.io/prowimages/kaniko:latest
args:
- --dockerfile=Dockerfile
- --context=/workspace/repo
- --destination=hub.tess.io/myapp:${PULL_NUMBER}
- --build-arg=VERSION=${VERSION}

3. Kind - Docker 中的 Kubernetes 测试环境

Kind 镜像能够在 CI 流水线中创建临时 Kubernetes 集群用于 E2E 测试。

特性:

  • 创建隔离的 Kubernetes 集群
  • 支持多个 Kubernetes 版本(1.20、1.32、1.34)
  • 与上游 Kubernetes 补丁集成
  • 可脚本化的集群配置
  • 自动清理

4. Auto Security PR - 自动化 RBAC 管理

这个专用工具自动化创建跨多个集群的安全相关 RBAC 资源的拉取请求。

工作流程:

  1. 在 YAML 中定义 RBAC 资源(ClusterRoles、ServiceAccounts、ClusterRoleBindings)
  2. 指定作用域:fcpclustertessAppsAZtessNetAZtessMasterAZ
  3. 针对特定集群或使用 all: true 针对作用域中的所有集群
  4. 该工具生成并提交 PR 到 sig-security 仓库

使用示例:

1
2
3
4
5
6
7
8
9
10
clusterRoles:
- metadata:
name: cluster-role-1
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
scope:
cluster:
all: true

5. Auto Approval/Validation - PR 自动化

这些组件处理拉取请求的自动批准和验证:

  • Auto Approval:当特定文件包含某些键值对时自动批准 PR
  • Auto Validation:验证 PR 更改的不可变性和 Kubernetes 对象的正确性

6. E2E 测试套件

多个 E2E 测试镜像提供全面的测试能力:

  • e2e:通用 E2E 测试的基础镜像
  • fd-e2e:专门用于功能开发 E2E 测试
  • tessci:用于 E2E 测试的集群获取和管理工具

7. 构建工具集合

各种专业的构建工具:

  • Bazel:多个版本(3.4.1、4.2.2、7.3.1、7.7.1)用于 Bazel 构建
  • Go:标准化的 Golang 构建环境
  • Ko:Go 容器镜像构建器
  • Buildctl:BuildKit CLI 包装器
  • Python:Python 运行时环境(3.8.15、3.14.0)

8. Git 操作

Git 相关实用工具:

  • git:核心 Git 操作包装器
  • git-sync-k8s-patches:同步 Kubernetes 补丁与上游
  • make-commit:自动化提交创建

9. 开发工具

  • helm-bot:自动化 Helm chart 管理和 PR 创建
  • autotag:自动化版本标记
  • clone-and-do:克隆仓库并执行命令(Bash、Make)
  • gotestcover:Go 测试覆盖率分析和报告

10. 专用工具

  • canirun:镜像漏洞扫描
  • prow-config-validator:验证 Prow 配置文件
  • prow-image-builder:构建此仓库中定义的镜像
  • create-release:自动化发布创建
  • release-notes:生成发布说明

三仓库生态系统

仓库 1:prow-images

用途:容器镜像定义和构建逻辑

位置/Users/tashen/prow-images

内容

  • 35+ 个专用容器镜像定义
  • Dockerfile 和 VERSION 文件
  • 基于 Go 的入口应用程序
  • 通过 Makefile 进行构建编排
  • 公共 Go 库(git 实用工具、清单处理)

构建过程

1
2
3
4
5
6
7
8
# 构建所有镜像
make image

# 推送所有镜像到镜像仓库
make push

# 构建特定镜像
make image-kaniko

镜像仓库:所有镜像都推送到 hub.tess.io/prowimages/

仓库 2:prow-configs

用途:Prow 作业配置和 CI/CD 流水线定义

位置/Users/tashen/prow-configs

结构

1
2
3
4
5
6
7
8
9
prow-configs/
├── jobs/
│ ├── ebaytess/ # 组织特定的作业
│ ├── ebayistio/
│ ├── ESTOOLS/
│ └── ...
├── prow-configs/ # Prow 配置文件
├── hack/ # 辅助脚本
└── Makefile

关键文件

  • ci.manifest:CI Generator 处理的简化作业定义
  • *.yaml:自动生成的 Prow 作业规范
  • presubmit、postsubmit 和 periodic 作业的配置

工作流程

  1. 开发者创建/修改 ci.manifest 文件
  2. 运行 make jobgen 生成作业规范
  3. CI 验证生成的配置
  4. 合并后,Prow 加载新配置

仓库 3:test-infra/prow/cmd/manual-trigger

用途:手动作业触发服务

位置/Users/tashen/test-infra/prow/cmd/manual-trigger

功能:HTTP 服务,允许在没有 GitHub 事件的情况下触发 Prow 作业

完整工作流程

场景 1:添加新的构建作业

  1. 开发者操作(prow-configs)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    # 在 prow-configs/jobs/myorg/myrepo/ci.manifest
    myorg/myrepo:
    - jobGenType: Build
    name: build-myapp
    branch: main
    dockerFile: Dockerfile
    versionFile: VERSION
    buildTime: kaniko
    targets:
    - onPr:
    imageTags:
    - hub.tess.io/myorg/myapp:pr-${PULL_NUMBER}
  2. CI 生成

    • CI Generator(来自 prow-images)读取清单
    • 生成完整的 Prow 作业 YAML 规范
    • 创建在 PR 上运行的 presubmit 作业
    • 配置来自 prow-images 的 Kaniko 镜像作为作业容器
  3. 作业执行

    • 当创建 PR 时,Prow 触发 presubmit 作业
    • Kaniko 镜像克隆仓库
    • 使用指定的 Dockerfile 构建容器
    • 推送到镜像仓库,标签为 pr-${PULL_NUMBER}
    • 将构建状态发布回 GitHub PR

场景 2:手动运行 E2E 测试

  1. 开发者需要测试特定提交

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    curl -X POST "http://manual-trigger.tessprow/manual-trigger" \
    -H "Content-Type: application/json" \
    -d '{
    "org": "tess",
    "repo": "tessops",
    "base_ref": "feature-branch",
    "prowtype": "postsubmit",
    "prowjob": "sddz-e2e-k8s-1.32",
    "user": "developer-name"
    }'
  2. Manual Trigger 服务

    • 验证请求参数
    • 在来自 prow-configs 的 Prow 配置中查找作业
    • 在 Kubernetes 中创建 ProwJob 自定义资源
    • 设置 AUTHOR=developer-name 环境变量
  3. 作业执行

    • Prow 调度器拾取 ProwJob
    • 使用来自 prow-images 的 e2e 镜像
    • 配置 Kind 集群(使用来自 prow-images 的 kind 镜像)
    • 运行 E2E 测试
    • 报告结果(但不会发布到 GitHub,因为是手动触发)

场景 3:自动生成的安全 PR

  1. 安全团队操作

    • 在 YAML 文件中定义 RBAC 资源
    • 指定目标集群(集群 11、22 或所有集群)
  2. 自动化工作流程

    • Prow periodic 作业触发 auto-security-pr 镜像
    • 镜像扫描集群配置
    • 为每个集群生成适当的 RBAC 清单
    • 创建 PR 到 sig-security 仓库
    • Auto-validation 作业验证 PR
    • 如果满足条件,Auto-approval 作业批准

场景 4:构建和更新 prow-images 本身

  1. 开发者更新 Kaniko 镜像

    • 修改 /Users/tashen/prow-images/kaniko/entrypoint/main.go
    • /Users/tashen/prow-images/kaniko/VERSION 中将版本提升到 v0.0.2
  2. 本地构建

    1
    2
    3
    cd /Users/tashen/prow-images
    make image-kaniko
    make push-kaniko
  3. CI 集成

    • prow-images 的 PR 触发 presubmit 作业
    • prow-image-builder 作业构建所有修改的镜像
    • 测试验证新镜像
    • 合并后,postsubmit 作业构建并推送到镜像仓库
  4. Prow-configs 更新

    • prow-configs 中引用 Kaniko 镜像的作业现在可以使用新版本
    • 在清单文件中更新镜像标签或使用 latest 标签自动更新

关键集成点

1. 镜像仓库作为中心枢纽

所有组件通过 hub.tess.io 的中央镜像仓库进行通信:

  • prow-images 构建并推送到 hub.tess.io/prowimages/
  • prow-configs 从此镜像仓库引用镜像
  • 应用程序镜像构建并推送到组织特定的命名空间

2. 清单驱动的配置

CI Generator 在简单清单和复杂 Prow 配置之间建立桥梁:

  • 开发者编写简单的 .manifest 文件
  • CI Generator(来自 prow-images)将它们转换为完整的作业规范
  • Prow(由 prow-configs 配置)执行这些作业
  • 作业使用来自 prow-images 的镜像

3. Git 作为真相来源

所有三个仓库都使用 Git 进行版本控制和触发:

  • prow-images 的更改触发镜像重建
  • prow-configs 的更改触发配置验证
  • 应用程序仓库的更改触发在 prow-configs 中定义的作业
  • Manual-trigger 在需要时提供带外触发

4. Kubernetes 原生架构

一切都在 Kubernetes 上运行:

  • Prow 组件作为 Kubernetes 服务运行
  • ProwJob 是 Kubernetes 自定义资源
  • 所有作业执行都在 Kubernetes Pod 中进行
  • 来自 prow-images 的镜像提供 Pod 容器

高级功能

版本管理

prow-images 中的每个镜像都维护一个 VERSION 文件:

1
v0.0.1

这使得:

  • 工具的语义版本控制
  • 可重现的构建
  • 回滚能力
  • CI 作业中的镜像标签生成

多触发器支持

作业可以配置在不同的触发器上运行:

  • onPr:在拉取请求时运行
  • onTag:当推送特定标签模式时运行
  • Manual:通过 manual-trigger 服务触发
  • Periodic:定期执行

安全和认证

  • Git 操作使用基于令牌的认证
  • 镜像支持用于加密仓库的 git-crypt
  • 通过 Kubernetes Secret 进行镜像仓库认证
  • Prow 作业执行权限的 RBAC

可观测性

  • 在端口 9090 上暴露 Prometheus 指标
  • 所有镜像中的详细日志
  • PR 作业的 GitHub 状态报告
  • 在 Prow UI(deck)中可见 ProwJob 状态

最佳实践

prow-images 开发

  1. 版本提升:进行更改时始终更新 VERSION 文件
  2. 测试:推送前在本地测试镜像
  3. 文档:对重大更改更新组件 README
  4. 向后兼容性:更改接口时考虑现有用户

prow-configs 管理

  1. 使用清单:优先使用 .manifest 文件而不是手写作业 YAML
  2. 本地验证:提交 PR 前运行 make jobgen
  3. 测试作业:合并前使用 manual-trigger 测试新作业
  4. 避免手动编辑:永远不要直接编辑自动生成的文件

手动触发

  1. 使用正确的类型:选择 presubmit 用于 PR 测试,postsubmit 用于分支测试
  2. 设置用户:始终提供 user 参数以进行审计跟踪
  3. 监控作业:在 Prow UI 或通过 kubectl 检查作业状态
  4. 清理:应该调查并清理失败的作业

常见工作流程总结

向 prow-images 添加新工具

1
2
3
4
5
6
7
8
1. 创建目录:prow-images/mytool/
2. 添加 Dockerfile
3. 添加 VERSION 文件
4. 实现 entrypoint/main.go
5. 更新 Makefile 添加构建目标
6. 构建:make image-mytool
7. 推送:make push-mytool
8. 在 prow-configs 作业定义中使用

向 prow-configs 添加新作业

1
2
3
4
5
6
7
1. 创建/编辑 ci.manifest 文件
2. 使用 jobGenType、name、triggers 定义作业
3. 运行:make jobgen
4. 验证生成的 YAML
5. 创建 PR 到 prow-configs
6. CI 验证配置
7. 合并 → Prow 加载新作业

手动触发作业

1
2
3
4
5
6
1. 从 prow-configs 查找作业名称
2. 识别 org、repo、branch
3. POST 到 manual-trigger 服务
4. 在响应中接收 job_name
5. 监控:kubectl get prowjobs -n tessprow | grep <job_name>
6. 在 Prow UI 中查看日志

架构图(概念)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
┌─────────────────────────────────────────────────────────┐
│ GitHub 事件 │
│ (创建 PR、推送提交) │
└────────────────────┬────────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ Prow 控制器 │
│ (从 prow-configs 仓库读取配置) │
└─────────┬──────────────────────────────────────┬────────┘
│ │
│ 触发 ProwJob │
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ Manual Trigger │ │ 定期作业 │
│ (HTTP 服务) │ │ (Periodic) │
└─────────┬────────────┘ └──────────┬───────────┘
│ │
│ 创建 ProwJob │
│ │
▼ ▼
┌─────────────────────────────────────────────────────────┐
│ Kubernetes ProwJob 资源 │
└─────────┬───────────────────────────────────────────────┘

│ 生成 Pod


┌─────────────────────────────────────────────────────────┐
│ 容器执行 │
│ (使用来自 prow-images 仓库的镜像) │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Kaniko │ │ Kind │ │ E2E │ │
│ │ 镜像 │ │ 镜像 │ │ 镜像 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Git │ │ CI-Gen │ │ Auto-Sec │ │
│ │ 镜像 │ │ 镜像 │ │ 镜像 │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────┬───────────────────────────────────────────────┘

│ 结果


┌─────────────────────────────────────────────────────────┐
│ GitHub 状态 / Prow UI │
└─────────────────────────────────────────────────────────┘

故障排除指南

镜像构建失败

问题make image-kaniko 失败

  • 检查 Dockerfile 语法
  • 验证基础镜像可用性
  • 确保依赖项已被 vendor
  • 检查 Docker 守护进程是否运行

作业不运行

问题:Prow 作业在 PR 上不触发

  • 验证 prow-configs 中是否启用了 trigger 插件
  • 检查作业名称在配置中是否匹配
  • 确保 prow-configs release 对象状态为 Succeeded
  • 验证清单文件是否已正确处理

Manual Trigger 错误

问题:”配置中未找到作业 X”

  • 验证确切的作业名称(区分大小写)
  • 检查 org/repo 组合是否正确
  • 确保已部署最新的 prow-configs
  • 验证作业类型匹配(presubmit/postsubmit)

镜像版本不匹配

问题:作业使用旧镜像版本

  • 检查作业规范中的镜像标签
  • 验证新镜像是否已推送到镜像仓库
  • 通过删除 :latest 标签强制拉取
  • 检查 imagePullPolicy 是否设置正确

结论

prow-images 生态系统代表了基于 Kubernetes 和 Prow 构建的全面的生产级 CI/CD 基础设施。三仓库架构提供了清晰的关注点分离:

  • prow-images:工具和实用程序(”如何做”)
  • prow-configs:作业定义和流水线(”做什么”)
  • manual-trigger:带外控制平面(”何时做”)

这些组件共同实现了:

  • 自动化构建和测试
  • 安全的容器镜像创建
  • 临时测试环境
  • 自动化安全管理
  • 灵活的作业触发
  • 企业级规模的可扩展 CI/CD

通过 CI Generator 的清单驱动方法显著降低了 Prow 配置的复杂性,使开发者能够轻松使用,同时保持 Kubernetes 原生 CI/CD 的全部功能。

无论您是构建新工具、添加测试作业还是手动触发部署,理解这三个仓库如何交互是有效使用这个强大 CI/CD 平台的关键。

延伸阅读


最后更新:2026 年 4 月 8 日

Linux 存储与文件系统深度剖析系列(一):总览与架构

系列介绍

欢迎来到 Linux 存储与文件系统深度剖析系列!本系列将从代码级别深入探讨 Linux 内核中的存储子系统和文件系统实现,基于 Linux 6.4-rc1 内核源码进行分析。

系列文章目录

  1. 总览与架构(本文)
  2. VFS 虚拟文件系统层深度剖析
  3. 块设备层(Block Layer)详解
  4. 页缓存(Page Cache)与缓冲区缓存机制
  5. Ext4 文件系统源码分析
  6. Btrfs 文件系统核心原理
  7. XFS 文件系统实现细节
  8. IO 调度器深度剖析
  9. 直接 IO 与异步 IO 实现
  10. 文件系统性能优化与调试

Linux 存储栈整体架构

Linux 存储栈是一个复杂的分层架构,从用户空间到物理硬件,主要包含以下几层:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
┌─────────────────────────────────────────────┐
│ 用户空间应用 │
│ (read, write, open, close, mmap...) │
└─────────────────┬───────────────────────────┘
│ System Call Interface
┌─────────────────▼───────────────────────────┐
│ VFS (Virtual Filesystem Switch) │
│ - struct inode, dentry, file, super │
│ - 通用文件操作接口 │
└─────────────────┬───────────────────────────┘

┌───────────┼───────────┐
│ │ │
┌─────▼────┐ ┌───▼────┐ ┌───▼────┐
│ Ext4 │ │ Btrfs │ │ XFS │ ... (具体文件系统)
└─────┬────┘ └───┬────┘ └───┬────┘
│ │ │
└──────────┼──────────┘

┌────────────────▼────────────────────────────┐
│ Page Cache / Buffer Cache │
│ (struct address_space, struct page) │
└────────────────┬────────────────────────────┘

┌────────────────▼────────────────────────────┐
│ Block Layer (块设备层) │
│ - struct bio, struct request │
│ - 块设备通用操作 │
└────────────────┬────────────────────────────┘

┌────────────────▼────────────────────────────┐
│ IO Scheduler (IO调度器) │
│ - mq-deadline, BFQ, Kyber │
└────────────────┬────────────────────────────┘

┌────────────────▼────────────────────────────┐
│ Block Device Drivers (块设备驱动) │
│ - SCSI, NVMe, virtio-blk, ... │
└────────────────┬────────────────────────────┘

┌────────────────▼────────────────────────────┐
│ Physical Storage (物理存储) │
│ - HDD, SSD, NVMe, RAID, ... │
└─────────────────────────────────────────────┘

核心数据结构概览

1. VFS 层核心结构

在 Linux 内核中,VFS 使用以下核心数据结构:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// include/linux/fs.h

// inode: 代表文件系统中的一个对象(文件、目录等)
struct inode {
umode_t i_mode; // 文件类型和权限
unsigned short i_opflags;
kuid_t i_uid; // 所有者 UID
kgid_t i_gid; // 所有者 GID
unsigned int i_flags;

const struct inode_operations *i_op; // inode 操作方法
struct super_block *i_sb; // 所属超级块
struct address_space *i_mapping; // 页缓存映射

loff_t i_size; // 文件大小(字节)
struct timespec64 i_atime; // 访问时间
struct timespec64 i_mtime; // 修改时间
struct timespec64 i_ctime; // 状态改变时间

unsigned long i_ino; // inode 编号
dev_t i_rdev; // 设备号(设备文件)

// ... 更多字段
};

// dentry: 目录项,连接路径名和 inode
struct dentry {
unsigned int d_flags;
seqcount_spinlock_t d_seq;
struct hlist_bl_node d_hash;
struct dentry *d_parent; // 父目录项
struct qstr d_name; // 文件名
struct inode *d_inode; // 关联的 inode

const struct dentry_operations *d_op; // dentry 操作方法
struct super_block *d_sb; // 所属超级块

// ... 更多字段
};

// file: 代表打开的文件
struct file {
struct path f_path; // 包含 dentry 和 vfsmount
struct inode *f_inode; // 缓存的 inode 指针
const struct file_operations *f_op; // 文件操作方法

spinlock_t f_lock;
atomic_long_t f_count; // 引用计数
unsigned int f_flags; // 打开标志
fmode_t f_mode; // 文件模式
loff_t f_pos; // 文件位置指针

struct address_space *f_mapping; // 页缓存映射

// ... 更多字段
};

// super_block: 超级块,代表已挂载的文件系统
struct super_block {
struct list_head s_list;
dev_t s_dev; // 设备号
unsigned char s_blocksize_bits;
unsigned long s_blocksize; // 块大小
loff_t s_maxbytes; // 最大文件大小
struct file_system_type *s_type; // 文件系统类型
const struct super_operations *s_op; // 超级块操作方法

unsigned long s_flags;
unsigned long s_magic; // 魔数
struct dentry *s_root; // 根目录项

// ... 更多字段
};

2. Block Layer 核心结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// include/linux/blk_types.h

// bio: Block I/O,描述一次块 I/O 操作
struct bio {
struct bio *bi_next; // 请求队列链表
struct block_device *bi_bdev; // 目标块设备
unsigned int bi_opf; // 操作标志(读/写等)

unsigned short bi_vcnt; // bio_vec 数组大小
unsigned short bi_max_vecs; // bio_vec 最大数量

atomic_t __bi_cnt; // 引用计数
struct bio_vec *bi_io_vec; // 指向页向量数组

bio_end_io_t *bi_end_io; // I/O 完成回调
void *bi_private; // 私有数据

// ... 更多字段
};

// request: 代表一个或多个 bio 组成的请求
struct request {
struct request_queue *q; // 所属请求队列
struct blk_mq_ctx *mq_ctx; // 多队列上下文
struct blk_mq_hw_ctx *mq_hctx; // 硬件队列上下文

unsigned int cmd_flags; // 命令标志
enum rq_qos_id rq_qos_id; // QoS ID

sector_t __sector; // 起始扇区
unsigned int __data_len; // 数据长度

struct bio *bio; // bio 链表头
struct bio *biotail; // bio 链表尾

// ... 更多字段
};

关键源码文件位置

基于 Linux 6.4-rc1 内核,以下是主要的源码文件位置:

VFS 层

  • fs/ - 文件系统核心代码
    • fs/namei.c - 路径查找和解析
    • fs/open.c - 文件打开操作
    • fs/read_write.c - 读写操作
    • fs/inode.c - inode 管理
    • fs/dcache.c - dentry 缓存
    • fs/super.c - 超级块管理

Block Layer

  • block/ - 块设备层代码
    • block/blk-core.c - 核心功能
    • block/blk-mq.c - 多队列实现
    • block/bio.c - bio 处理
    • block/blk-merge.c - 请求合并
    • block/blk-settings.c - 块设备设置

IO Schedulers

  • block/mq-deadline.c - Deadline 调度器
  • block/bfq-iosched.c - BFQ (Budget Fair Queueing) 调度器
  • block/kyber-iosched.c - Kyber 调度器

具体文件系统

  • fs/ext4/ - Ext4 文件系统
  • fs/btrfs/ - Btrfs 文件系统
  • fs/xfs/ - XFS 文件系统

Page Cache

  • mm/filemap.c - 文件映射和页缓存
  • mm/readahead.c - 预读机制
  • fs/buffer.c - 缓冲区缓存

一个文件读取的完整流程

让我们通过一个简单的 read() 系统调用,看看数据是如何从磁盘流向用户空间的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
用户程序: read(fd, buffer, size)
|
v
系统调用入口: sys_read() (fs/read_write.c)
|
v
VFS 层: vfs_read()
| - 检查权限和文件状态
| - 调用 file->f_op->read_iter()
v
文件系统层: ext4_file_read_iter() (以 Ext4 为例)
| - 调用 generic_file_read_iter()
v
Page Cache: filemap_read() (mm/filemap.c)
| - 在页缓存中查找数据
| - 如果缓存命中,直接返回
| - 如果缓存未命中,继续...
v
文件系统 readpage: ext4_readpage()
| - 准备从磁盘读取数据
| - 创建 bio 结构
v
Block Layer: submit_bio() (block/bio.c)
| - 将 bio 提交到块设备层
| - 可能进行请求合并
v
IO Scheduler: 调度器处理请求
| - mq-deadline / BFQ / Kyber
| - 优化 I/O 顺序
v
Block Driver: SCSI/NVMe/... 驱动程序
| - 向硬件发送命令
v
Hardware: 物理存储设备执行读取
|
v (中断返回)
Block Driver: 中断处理程序
|
v
Block Layer: bio_endio()
| - 调用 bio->bi_end_io() 回调
v
Page Cache: 将数据标记为最新
|
v
VFS: 将数据复制到用户空间缓冲区
|
v
用户程序: read() 返回

核心概念解析

1. 页缓存(Page Cache)

页缓存是 Linux 内存管理的核心组件之一,它缓存文件内容在内存中,避免重复的磁盘访问。

关键特性:

  • 以页(通常 4KB)为单位缓存文件数据
  • 使用 struct address_space 管理文件到物理页的映射
  • 支持预读(readahead)机制优化顺序读取
  • 支持回写(writeback)机制延迟写入

代码位置: mm/filemap.c

2. 块 I/O 层(Block Layer)

块设备层是连接文件系统和设备驱动的中间层。

核心功能:

  • 将文件系统的 I/O 请求转换为块设备请求
  • I/O 请求的合并和重排序
  • 支持多队列(blk-mq)架构
  • 提供 I/O 统计和 QoS 控制

代码位置: block/

3. VFS(Virtual Filesystem Switch)

VFS 是所有文件系统的抽象层,提供统一的接口。

设计理念:

  • 对上层提供统一的系统调用接口
  • 对下层提供标准的文件系统操作接口
  • 使用面向对象的设计思想(通过函数指针实现多态)

代码位置: fs/

代码示例:创建一个简单的文件系统

为了更好地理解 VFS,让我们看一个极简的文件系统示例(基于 ramfs):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// 超级块操作
static const struct super_operations simple_super_ops = {
.statfs = simple_statfs,
.drop_inode = generic_delete_inode,
};

// inode 操作
static const struct inode_operations simple_dir_inode_operations = {
.create = simple_create,
.lookup = simple_lookup,
.link = simple_link,
.unlink = simple_unlink,
.mkdir = simple_mkdir,
.rmdir = simple_rmdir,
.mknod = simple_mknod,
.rename = simple_rename,
};

// 文件操作
static const struct file_operations simple_file_operations = {
.read_iter = generic_file_read_iter,
.write_iter = generic_file_write_iter,
.mmap = generic_file_mmap,
.fsync = noop_fsync,
.llseek = generic_file_llseek,
};

// 填充超级块
static int simple_fill_super(struct super_block *sb, void *data, int silent)
{
struct inode *inode;

sb->s_blocksize = PAGE_SIZE;
sb->s_blocksize_bits = PAGE_SHIFT;
sb->s_magic = 0x12345678;
sb->s_op = &simple_super_ops;

// 创建根 inode
inode = new_inode(sb);
if (!inode)
return -ENOMEM;

inode->i_ino = 1;
inode->i_mode = S_IFDIR | 0755;
inode->i_op = &simple_dir_inode_operations;
inode->i_fop = &simple_file_operations;

// 创建根 dentry
sb->s_root = d_make_root(inode);
if (!sb->s_root)
return -ENOMEM;

return 0;
}

性能关键路径

在 Linux 存储栈中,以下是几个性能关键路径:

1. 快速路径(Fast Path)

  • 页缓存命中
  • 零拷贝操作(sendfile, splice)
  • 直接 I/O(绕过页缓存)

2. 慢速路径(Slow Path)

  • 页缓存未命中导致的磁盘 I/O
  • 同步写入和 fsync 操作
  • 元数据操作(创建、删除文件)

调试工具

在学习和调试存储系统时,以下工具非常有用:

系统工具

1
2
3
4
5
6
7
8
9
10
11
12
# 查看块设备 I/O 统计
iostat -x 1

# 追踪块 I/O
blktrace -d /dev/sda -o - | blkparse -i -

# 查看文件系统缓存统计
cat /proc/meminfo | grep -i cache

# 查看 VFS 统计
cat /proc/sys/fs/dentry-state
cat /proc/sys/fs/inode-state

内核工具

1
2
3
4
5
6
# 动态追踪(需要 CONFIG_DYNAMIC_FTRACE)
trace-cmd record -e block:* -e vfs:* command

# BPF 工具
biolatency # I/O 延迟分布
biosnoop # 追踪块 I/O 事件

下一篇预告

在下一篇文章中,我们将深入探讨 VFS 虚拟文件系统层,包括:

  • inode、dentry、file 的生命周期管理
  • 路径查找算法的实现细节
  • dcache(dentry cache)的设计与优化
  • VFS 如何处理各种文件操作的源码分析

参考资料

  1. Linux 内核源码(v6.4-rc1):https://git.kernel.org/
  2. Linux 内核文档:Documentation/filesystems/
  3. 《深入 Linux 内核架构》
  4. 《Linux 内核设计与实现》

作者注: 本系列所有源码分析基于 Linux 6.4-rc1 内核版本。随着内核的演进,部分实现细节可能会有所变化,但核心设计理念保持相对稳定。

Understanding CIMaster: Intelligent CI Cluster Coordination at Scale

In modern cloud-native development, continuous integration (CI) pipelines are the backbone of software delivery. At scale, managing shared test infrastructure becomes a critical challenge. This is where CIMaster comes in—a sophisticated cluster management service designed to coordinate access to shared CI test clusters, ensuring efficient resource utilization and preventing test conflicts.

The Problem: Shared CI Infrastructure at Scale

In large organizations running hundreds or thousands of CI jobs daily, test clusters are expensive resources that need to be shared efficiently. Key challenges include:

  1. Resource Contention: Multiple CI jobs competing for limited test clusters
  2. Cluster State Management: Tracking which clusters are available, occupied, or held for debugging
  3. Manual Intervention: Developers needing to hold clusters for investigation without blocking others
  4. Dynamic Provisioning: Creating new clusters on-demand when capacity is insufficient
  5. Lifecycle Management: Automatically releasing clusters after use or expiration

CIMaster addresses all these challenges through a centralized coordination service.

Architecture Overview

CIMaster is a Kubernetes-native service written in Go that provides a REST API for cluster lifecycle management. It consists of several key components:

Core Components

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
┌─────────────────────────────────────────────────────────────┐
│ CIMaster Service │
│ │
│ ┌──────────────────┐ ┌─────────────────────────┐ │
│ │ HTTP API Server │ │ Cluster Manager │ │
│ │ (Port 8080) │◄─────►│ - State Management │ │
│ │ │ │ - Allocation Logic │ │
│ │ /getvacant │ │ - Hold Expiration │ │
│ │ /holdcluster │ │ │ │
│ │ /releasecluster │ │ │ │
│ │ /createcluster │ └───────────┬─────────────┘ │
│ │ ... │ │ │
│ └──────────────────┘ │ │
│ │ │
│ ┌──────────────────┐ ┌───────────▼─────────────┐ │
│ │ Metrics Server │ │ Kubernetes ConfigMap │ │
│ │ (Port 8090) │ │ - cluster.json │ │
│ │ │ │ - Optimistic Locking │ │
│ └──────────────────┘ └─────────────────────────┘ │
│ │
└────────────────┬─────────────────────────────────────────────┘

│ HTTP POST

┌───────────────────────┐
│ Prow Manual Trigger │
│ /manual-trigger │
└───────────────────────┘

1. Cluster Manager (cluster-manager.go)

The heart of the system, responsible for:

  • Cluster Allocation: Finding and assigning vacant clusters to CI jobs
  • Hold Management: Allowing developers to reserve clusters for debugging (with 6-hour expiration)
  • Automatic Cleanup: Periodically releasing expired holds
  • Integration with Prow: Triggering cluster creation through Prow’s manual-trigger endpoint

2. Cluster Operations (cluster-ops.go)

Implements the ClusterInterface with operations like:

  • OccupyVacantCluster: Atomically allocate an available cluster
  • FinishOccupiedCluster: Return a cluster to the available pool
  • HoldCluster/ReleaseCluster: Manual hold management
  • AddCluster/DeleteCluster: Cluster inventory management

3. State Persistence

All cluster state is stored in a Kubernetes ConfigMap (clusters in the ci namespace):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
{
"name": "cluster-01",
"region": "us-west",
"status": "testing",
"lastJob": "e2e-conformance",
"lastBuild": "12345",
"lastTriggerName": "john",
"hold": false,
"disabled": false,
"purpose": "tess-ci",
"osimage": "centos-atomic-7.6.1810-qcow2"
}
]

Optimistic Locking prevents race conditions during concurrent updates using Kubernetes ResourceVersion.

Integration with Prow’s Manual Trigger

One of CIMaster’s powerful features is its integration with Prow through the manual-trigger component. This enables dynamic cluster provisioning when existing capacity is insufficient.

What is Prow Manual Trigger?

Prow is Kubernetes’ CI/CD system. The manual-trigger component (/Users/tashen/test-infra/prow/cmd/manual-trigger) is an HTTP server that allows programmatic creation of ProwJobs outside the normal GitHub webhook flow.

Key Capabilities:

  • Accepts HTTP POST requests with job specifications
  • Creates ProwJob custom resources in Kubernetes
  • Supports presubmit, postsubmit, and periodic job types
  • Injects environment variables (like AUTHOR) into jobs

How Uses Manual Trigger

When a user calls ‘s /createcluster endpoint:

1
curl "http://cimaster:8080/createcluster?user=john&branch=master&job=e2e-k8s-1.32"

CIMaster performs the following flow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// 1. Construct Prow request
prowRequest := types.ProwManualTriggerRequest{
Org: "tess",
Repo: "tessops",
BaseRef: "master", // from branch parameter
ProwType: "postsubmit",
ProwJob: "e2e-k8s-1.32", // cluster creation job
User: "john", // sets AUTHOR env var
}

// 2. Send to Prow manual-trigger endpoint
resp, err := http.Post(
"https://prow.tess.io/manual-trigger",
"application/json",
requestBody,
)

// 3. Return status to caller

On the Prow side, the manual-trigger service:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// 1. Receives the request
func (s *server) handleManualTrigger(w http.ResponseWriter, r *http.Request) {
var req triggerRequest
json.NewDecoder(r.Body).Decode(&req)

// 2. Looks up the job definition from config
postsubmits := cfg.PostsubmitsStatic[req.Org+"/"+req.Repo]
for _, p := range postsubmits {
if p.Name == req.ProwJob {
prowJob = createProwJobFromPostsubmit(p, req)
break
}
}

// 3. Injects AUTHOR environment variable
if req.User != "" {
addAuthorEnvToProwJob(prowJob, req.User)
}

// 4. Creates ProwJob in Kubernetes
prowJobClient.Create(ctx, prowJob, metav1.CreateOptions{})

// 5. Waits for BuildID and returns status link
statusLink := fmt.Sprintf("https://prow.tess.io/prowjob?prowjob=%s", prowJob.Name)
logLink := fmt.Sprintf("https://prow.tess.io/log?job=%s&id=%s", req.ProwJob, buildID)
}

Request-Response Flow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
┌──────────┐         ┌──────────┐         ┌──────────────┐         ┌────────────┐
│ User │ │ <USER_NAME> │ │ Prow Manual │ │ Kubernetes │
│ │ │ │ │ Trigger │ │ │
└────┬─────┘ └────┬─────┘ └──────┬───────┘ └─────┬──────┘
│ │ │ │
│ POST /createcluster │ │
│ user=john │ │ │
├──────────────────►│ │ │
│ │ │ │
│ │ POST /manual-trigger│ │
│ │ {org, repo, prowjob}│ │
│ ├─────────────────────►│ │
│ │ │ │
│ │ │ Create ProwJob │
│ │ │ with AUTHOR=john │
│ │ ├──────────────────────►│
│ │ │ │
│ │ │ ◄─────────────────── │
│ │ │ ProwJob Created │
│ │ │ │
│ │ ◄────────────────────┤ │
│ │ {success, job_name, │ │
│ ◄────────────────┤ status_link} │ │
│ cluster creation │ │ │
│ triggered │ │ │
│ │ │ │

The triggered ProwJob typically runs infrastructure-as-code (like Terraform or Ansible) to provision a new Kubernetes cluster, which is then added to CIMaster’s pool once ready.

Cluster Lifecycle State Machine

Clusters transition through several states:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
┌───────────┐
│ finished │ ◄───────────────────┐
│ (vacant) │ │
└─────┬─────┘ │
│ │
│ /getvacant │
│ │
▼ │
┌──────────┐ │
│ testing │ │
│(occupied)│ │
└────┬─────┘ │
│ │
│ /finishtest │
│ │
└────────────────────────────┘

Hold State (overlay):
┌─────────┐
│ hold= │
│ false │◄──── /releasecluster ────┐
└────┬────┘ │
│ │
│ /holdcluster │
│ │
▼ │
┌─────────┐ │
│ hold= │ │
│ true │────────────────────────────┘
└─────────┘ (auto-expires in 6h)

Key Features and Implementation Details

1. Intelligent Allocation with Retry Logic

CIMaster implements exponential backoff with jitter to handle concurrent allocation:

1
2
3
4
5
6
7
8
9
10
type RandomBackoff struct {
MinBackoff time.Duration
MaxBackoff time.Duration
rng *rand.Rand
}

func (rb *RandomBackoff) GetRetryInterval() time.Duration {
delta := rb.MaxBackoff - rb.MinBackoff
return rb.MinBackoff + time.Duration(rb.rng.Int63n(int64(delta)+1))
}

Each operation retries up to 3 times with random 50-200ms backoff to avoid thundering herd problems.

2. Automatic Hold Expiration

A background goroutine continuously checks for expired holds:

1
2
3
4
5
6
7
func (cm *ClusterManager) runCronReleaseHeldEnvs() {
for {
durationUntilNextExpire, err := cm.clearExpiredHolds()
timer.Reset(durationUntilNextExpire)
<-timer.C
}
}

This ensures clusters don’t remain locked indefinitely if developers forget to release them.

3. Multi-Purpose Cluster Support

CIMaster supports different cluster types:

  • tess-ci: Standard CI test clusters
  • tnet-ci: Network-specific test clusters with OS image selection

Allocation respects purpose and OS image requirements:

1
2
3
4
5
6
if cluster.Purpose != purpose {
continue // skip incompatible clusters
}
if cluster.Purpose == TnetCI && cluster.OSImage != osimage {
continue // skip wrong OS image
}

4. Admin Authorization

Protected endpoints use a simple file-based authorization:

1
2
3
4
5
6
7
8
9
10
func checkUser(h http.HandlerFunc, users []string) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
userName := r.URL.Query().Get("name")
if !contains(users, userName) {
fmt.Fprintf(w, "user %s is not authorized", userName)
return
}
h(w, r)
}
}

Admin users are loaded from /botadmin/users file (semicolon-separated).

5. Observability

  • Prometheus Metrics: Exposed on port 8090 (/metrics)
  • Structured Logging: All operations logged with correlation IDs
  • Graceful Shutdown: 120-second grace period to handle in-flight requests

API Examples

Allocating a Cluster for CI

1
2
3
4
5
6
7
8
# Get a vacant cluster for build #123
CLUSTER=$(curl -s "http://cimaster:8080/getvacant?build=123&job=e2e-test&email=ci-bot@ebay.com")
echo "Using cluster: $CLUSTER"

# Run tests...

# Return cluster to pool
curl "http://cimaster:8080/finishtest?cluster=$CLUSTER"

Debugging Workflow

1
2
3
4
5
6
7
8
# Hold cluster for investigation
curl "http://cimaster:8080/holdcluster?cluster=cluster-05&name=alice&desc=debugging+network+issue"

# Investigate...
kubectl get pods -n test-namespace

# Release when done
curl "http://cimaster:8080/releasecluster?cluster=cluster-05&name=alice"

Creating a New Cluster

1
2
3
4
5
6
7
8
9
# Trigger cluster creation via Prow
curl "http://cimaster:8080/createcluster?user=alice&branch=master&job=e2e-k8s-1.32"
# Response: cluster creation triggered successfully: {...}

# Monitor Prow job
# https://prow.tess.io/prowjob?prowjob=<job-name>

# Once ready, admin adds it to the pool
curl "http://cimaster:8080/addcluster?cluster=cluster-20&region=eu-central&name=admin"

JSON API Support

For programmatic access:

1
curl -H "Accept: application/json" "http://cimaster:8080/getvacant?build=123&job=test"

Response:

1
2
3
4
5
6
7
8
9
{
"name": "cluster-07",
"region": "us-west",
"status": "testing",
"lastBuild": "123",
"lastJob": "test",
"lastTriggerName": "john",
"purpose": "tess-ci"
}

Deployment

CIMaster runs as a Kubernetes Deployment with 3 replicas for high availability:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: apps/v1
kind: Deployment
metadata:
name: cimaster
namespace: ci
spec:
replicas: 3
minReadySeconds: 90
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
template:
spec:
containers:
- name: cimaster
image: hub.tess.io/tess/cimaster:v0.0.37
command:
- cimaster
- --manageCluster=true
- --cluster-config-map=clusters
- --botAdminDir=/botadmin
- --prow-url=https://prow.tess.io/manual-trigger
- --default-prow-job=e2e-k8s-1.32
ports:
- containerPort: 8080 # API
- containerPort: 8090 # Metrics

Performance Characteristics

  • Allocation Latency: ~100-300ms (includes ConfigMap read-write cycle)
  • Retry Overhead: 50-200ms per retry (max 3 attempts)
  • Hold Expiration Check: Every 10 minutes (default)
  • Concurrency: Safe for multiple replicas via optimistic locking

Real-World Impact

At eBay’s TESS platform, CIMaster manages:

  • 20+ shared test clusters across multiple regions
  • Hundreds of CI jobs daily from various teams
  • 6-hour automatic hold expiration preventing resource lock-ups
  • Sub-second allocation for most requests
  • Dynamic scaling through Prow integration

Future Enhancements

Potential improvements being considered:

  1. Priority Queues: Allow critical jobs to jump the allocation queue
  2. Cluster Health Checks: Automatic disabling of unhealthy clusters
  3. Usage Analytics: Track allocation patterns and optimize capacity
  4. Webhook Notifications: Slack/email alerts for hold expirations
  5. Multi-Cluster Federation: Coordinate across multiple Kubernetes clusters

Conclusion

CIMaster demonstrates how a relatively simple coordination service can solve complex resource management challenges in CI/CD infrastructure. By combining:

  • Stateful cluster tracking in Kubernetes ConfigMaps
  • Optimistic locking for safe concurrent access
  • Automatic expiration for abandoned holds
  • Prow integration for dynamic provisioning
  • REST API for easy integration

…it provides a robust foundation for shared test infrastructure at scale.

The integration with Prow’s manual-trigger component is particularly elegant—CIMaster doesn’t need to know how to create clusters, only when to request them. This separation of concerns allows infrastructure teams to evolve cluster provisioning strategies independently.

Whether you’re building CI infrastructure for a large organization or looking to optimize resource utilization in your Kubernetes platform, the patterns demonstrated by CIMaster offer valuable insights into distributed system coordination.


This article explores the internal architecture of CIMaster, a production cluster coordination service. All code examples are from the actual implementation.

深入理解 CIMaster:大规模 CI 集群智能协调系统

在现代云原生开发中,持续集成(CI)流水线是软件交付的基石。在大规模场景下,管理共享测试基础设施成为一个关键挑战。这就是 CIMaster 发挥作用的地方——一个精巧的集群管理服务,旨在协调对共享 CI 测试集群的访问,确保资源的高效利用并防止测试冲突。

问题背景:大规模共享 CI 基础设施

在大型组织中,每天运行成百上千个 CI 任务时,测试集群是需要高效共享的昂贵资源。主要挑战包括:

  1. 资源竞争:多个 CI 任务竞争有限的测试集群
  2. 集群状态管理:跟踪哪些集群可用、被占用或被保留用于调试
  3. 人工干预:开发者需要保留集群进行调查而不阻塞其他人
  4. 动态供应:当容量不足时按需创建新集群
  5. 生命周期管理:使用后或过期后自动释放集群

CIMaster 通过一个集中式协调服务解决了所有这些挑战。

架构概览

CIMaster 是一个用 Go 编写的 Kubernetes 原生服务,提供集群生命周期管理的 REST API。它由几个关键组件组成:

核心组件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
┌─────────────────────────────────────────────────────────────┐
│ CIMaster 服务 │
│ │
│ ┌──────────────────┐ ┌─────────────────────────┐ │
│ │ HTTP API 服务器 │ │ 集群管理器 │ │
│ │ (端口 8080) │◄─────►│ - 状态管理 │ │
│ │ │ │ - 分配逻辑 │ │
│ │ /getvacant │ │ - Hold 过期 │ │
│ │ /holdcluster │ │ │ │
│ │ /releasecluster │ │ │ │
│ │ /createcluster │ └───────────┬─────────────┘ │
│ │ ... │ │ │
│ └──────────────────┘ │ │
│ │ │
│ ┌──────────────────┐ ┌───────────▼─────────────┐ │
│ │ 指标服务器 │ │ Kubernetes ConfigMap │ │
│ │ (端口 8090) │ │ - cluster.json │ │
│ │ │ │ - 乐观锁 │ │
│ └──────────────────┘ └─────────────────────────┘ │
│ │
└────────────────┬─────────────────────────────────────────────┘

│ HTTP POST

┌───────────────────────┐
│ Prow Manual Trigger │
│ /manual-trigger │
└───────────────────────┘

1. 集群管理器 (cluster-manager.go)

系统的核心,负责:

  • 集群分配:查找并分配空闲集群给 CI 任务
  • Hold 管理:允许开发者预留集群用于调试(6 小时过期)
  • 自动清理:定期释放过期的 hold
  • 与 Prow 集成:通过 Prow 的 manual-trigger 端点触发集群创建

2. 集群操作 (cluster-ops.go)

实现 ClusterInterface 接口,包含以下操作:

  • OccupyVacantCluster:原子性地分配可用集群
  • FinishOccupiedCluster:将集群返回到可用池
  • HoldCluster/ReleaseCluster:手动 hold 管理
  • AddCluster/DeleteCluster:集群库存管理

3. 状态持久化

所有集群状态存储在 Kubernetes ConfigMap 中(ci 命名空间中的 clusters):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
{
"name": "cluster-01",
"region": "us-west",
"status": "testing",
"lastJob": "e2e-conformance",
"lastBuild": "12345",
"lastTriggerName": "john",
"hold": false,
"disabled": false,
"purpose": "tess-ci",
"osimage": "centos-atomic-7.6.1810-qcow2"
}
]

乐观锁使用 Kubernetes ResourceVersion 防止并发更新时的竞争条件。

与 Prow Manual Trigger 的集成

CIMaster 的一个强大功能是通过 manual-trigger 组件与 Prow 的集成。这使得当现有容量不足时能够动态供应集群。

什么是 Prow Manual Trigger?

Prow 是 Kubernetes 的 CI/CD 系统。manual-trigger 组件(/Users/tashen/test-infra/prow/cmd/manual-trigger)是一个 HTTP 服务器,允许在正常 GitHub webhook 流程之外以编程方式创建 ProwJob。

核心能力:

  • 接受带有任务规格的 HTTP POST 请求
  • 在 Kubernetes 中创建 ProwJob 自定义资源
  • 支持 presubmit、postsubmit 和 periodic 任务类型
  • 向任务注入环境变量(如 AUTHOR

CIMaster 如何使用 Manual Trigger

当用户调用 CIMaster 的 /createcluster 端点时:

1
curl "http://cimaster:8080/createcluster?user=john&branch=master&job=e2e-k8s-1.32"

CIMaster 执行以下流程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// 1. 构造 Prow 请求
prowRequest := types.ProwManualTriggerRequest{
Org: "tess",
Repo: "tessops",
BaseRef: "master", // 来自 branch 参数
ProwType: "postsubmit",
ProwJob: "e2e-k8s-1.32", // 集群创建任务
User: "john", // 设置 AUTHOR 环境变量
}

// 2. 发送到 Prow manual-trigger 端点
resp, err := http.Post(
"https://prow.tess.io/manual-trigger",
"application/json",
requestBody,
)

// 3. 返回状态给调用者

在 Prow 端,manual-trigger 服务:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// 1. 接收请求
func (s *server) handleManualTrigger(w http.ResponseWriter, r *http.Request) {
var req triggerRequest
json.NewDecoder(r.Body).Decode(&req)

// 2. 从配置中查找任务定义
postsubmits := cfg.PostsubmitsStatic[req.Org+"/"+req.Repo]
for _, p := range postsubmits {
if p.Name == req.ProwJob {
prowJob = createProwJobFromPostsubmit(p, req)
break
}
}

// 3. 注入 AUTHOR 环境变量
if req.User != "" {
addAuthorEnvToProwJob(prowJob, req.User)
}

// 4. 在 Kubernetes 中创建 ProwJob
prowJobClient.Create(ctx, prowJob, metav1.CreateOptions{})

// 5. 等待 BuildID 并返回状态链接
statusLink := fmt.Sprintf("https://prow.tess.io/prowjob?prowjob=%s", prowJob.Name)
logLink := fmt.Sprintf("https://prow.tess.io/log?job=%s&id=%s", req.ProwJob, buildID)
}

请求-响应流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
┌──────────┐         ┌──────────┐         ┌──────────────┐         ┌────────────┐
│ 用户 │ │ CIMaster │ │ Prow Manual │ │ Kubernetes │
│ │ │ │ │ Trigger │ │ │
└────┬─────┘ └────┬─────┘ └──────┬───────┘ └─────┬──────┘
│ │ │ │
│ POST /createcluster │ │
│ user=john │ │ │
├──────────────────►│ │ │
│ │ │ │
│ │ POST /manual-trigger│ │
│ │ {org, repo, prowjob}│ │
│ ├─────────────────────►│ │
│ │ │ │
│ │ │ 创建 ProwJob │
│ │ │ AUTHOR=john │
│ │ ├──────────────────────►│
│ │ │ │
│ │ │ ◄─────────────────── │
│ │ │ ProwJob 已创建 │
│ │ │ │
│ │ ◄────────────────────┤ │
│ │ {success, job_name, │ │
│ ◄────────────────┤ status_link} │ │
│ 集群创建已触发 │ │ │
│ │ │ │

触发的 ProwJob 通常运行基础设施即代码(如 Terraform 或 Ansible)来供应新的 Kubernetes 集群,一旦就绪就会被添加到 CIMaster 的池中。

集群生命周期状态机

集群在几个状态之间转换:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
┌───────────┐
│ finished │ ◄───────────────────┐
│ (空闲) │ │
└─────┬─────┘ │
│ │
│ /getvacant │
│ │
▼ │
┌──────────┐ │
│ testing │ │
│ (占用中) │ │
└────┬─────┘ │
│ │
│ /finishtest │
│ │
└────────────────────────────┘

Hold 状态(叠加):
┌─────────┐
│ hold= │
│ false │◄──── /releasecluster ────┐
└────┬────┘ │
│ │
│ /holdcluster │
│ │
▼ │
┌─────────┐ │
│ hold= │ │
│ true │────────────────────────────┘
└─────────┘ (6小时后自动过期)

关键特性和实现细节

1. 带重试逻辑的智能分配

CIMaster 实现了带抖动的指数退避来处理并发分配:

1
2
3
4
5
6
7
8
9
10
type RandomBackoff struct {
MinBackoff time.Duration
MaxBackoff time.Duration
rng *rand.Rand
}

func (rb *RandomBackoff) GetRetryInterval() time.Duration {
delta := rb.MaxBackoff - rb.MinBackoff
return rb.MinBackoff + time.Duration(rb.rng.Int63n(int64(delta)+1))
}

每个操作最多重试 3 次,使用随机的 50-200ms 退避时间,避免惊群问题。

2. 自动 Hold 过期

后台 goroutine 持续检查过期的 hold:

1
2
3
4
5
6
7
func (cm *ClusterManager) runCronReleaseHeldEnvs() {
for {
durationUntilNextExpire, err := cm.clearExpiredHolds()
timer.Reset(durationUntilNextExpire)
<-timer.C
}
}

这确保了如果开发者忘记释放,集群不会被无限期锁定。

3. 多用途集群支持

CIMaster 支持不同的集群类型:

  • **tess-ci**:标准 CI 测试集群
  • **tnet-ci**:具有 OS 镜像选择的网络特定测试集群

分配时会遵守用途和 OS 镜像要求:

1
2
3
4
5
6
if cluster.Purpose != purpose {
continue // 跳过不兼容的集群
}
if cluster.Purpose == TnetCI && cluster.OSImage != osimage {
continue // 跳过错误的 OS 镜像
}

4. 管理员授权

受保护的端点使用简单的基于文件的授权:

1
2
3
4
5
6
7
8
9
10
func checkUser(h http.HandlerFunc, users []string) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
userName := r.URL.Query().Get("name")
if !contains(users, userName) {
fmt.Fprintf(w, "user %s is not authorized", userName)
return
}
h(w, r)
}
}

管理员用户从 /botadmin/users 文件加载(分号分隔)。

5. 可观测性

  • Prometheus 指标:在端口 8090 暴露(/metrics
  • 结构化日志:所有操作都使用关联 ID 记录
  • 优雅关闭:120 秒的宽限期来处理进行中的请求

API 示例

为 CI 分配集群

1
2
3
4
5
6
7
8
# 为构建 #123 获取空闲集群
CLUSTER=$(curl -s "http://cimaster:8080/getvacant?build=123&job=e2e-test&email=ci-bot@ebay.com")
echo "使用集群: $CLUSTER"

# 运行测试...

# 将集群返回到池中
curl "http://cimaster:8080/finishtest?cluster=$CLUSTER"

调试工作流

1
2
3
4
5
6
7
8
# Hold 集群进行调查
curl "http://cimaster:8080/holdcluster?cluster=cluster-05&name=alice&desc=debugging+network+issue"

# 调查...
kubectl get pods -n test-namespace

# 完成后释放
curl "http://cimaster:8080/releasecluster?cluster=cluster-05&name=alice"

创建新集群

1
2
3
4
5
6
7
8
9
# 通过 Prow 触发集群创建
curl "http://cimaster:8080/createcluster?user=admin&branch=master&job=e2e-k8s-1.32"
# 响应: cluster creation triggered successfully: {...}

# 监控 Prow 任务
# https://prow.tess.io/prowjob?prowjob=<job-name>

# 就绪后,管理员将其添加到池中
curl "http://cimaster:8080/addcluster?cluster=cluster-20&region=eu-central&name=admin"

JSON API 支持

用于编程访问:

1
curl -H "Accept: application/json" "http://cimaster:8080/getvacant?build=123&job=test"

响应:

1
2
3
4
5
6
7
8
9
{
"name": "cluster-07",
"region": "us-west",
"status": "testing",
"lastBuild": "123",
"lastJob": "test",
"lastTriggerName": "john",
"purpose": "tess-ci"
}

部署

CIMaster 作为 Kubernetes Deployment 运行,具有 3 个副本以实现高可用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: apps/v1
kind: Deployment
metadata:
name: cimaster
namespace: ci
spec:
replicas: 3
minReadySeconds: 90
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
template:
spec:
containers:
- name: cimaster
image: hub.tess.io/tess/cimaster:v0.0.37
command:
- cimaster
- --manageCluster=true
- --cluster-config-map=clusters
- --botAdminDir=/botadmin
- --prow-url=https://prow.tess.io/manual-trigger
- --default-prow-job=e2e-k8s-1.32
ports:
- containerPort: 8080 # API
- containerPort: 8090 # 指标

性能特性

  • 分配延迟:约 100-300ms(包括 ConfigMap 读写周期)
  • 重试开销:每次重试 50-200ms(最多 3 次尝试)
  • Hold 过期检查:每 10 分钟(默认)
  • 并发性:通过乐观锁对多个副本安全

实际应用影响

在 eBay 的 TESS 平台,CIMaster 管理:

  • 20+ 个跨多个区域的共享测试集群
  • 每天数百个来自不同团队的 CI 任务
  • 6 小时自动 hold 过期防止资源锁定
  • 亚秒级分配适用于大多数请求
  • 通过 Prow 集成实现动态扩展

未来增强

正在考虑的潜在改进:

  1. 优先级队列:允许关键任务跳过分配队列
  2. 集群健康检查:自动禁用不健康的集群
  3. 使用分析:跟踪分配模式并优化容量
  4. Webhook 通知:hold 过期的 Slack/邮件警报
  5. 多集群联邦:跨多个 Kubernetes 集群协调

结论

CIMaster 展示了一个相对简单的协调服务如何解决 CI/CD 基础设施中的复杂资源管理挑战。通过结合:

  • Kubernetes ConfigMap 中的有状态集群跟踪
  • 用于安全并发访问的乐观锁
  • 废弃 hold 的自动过期
  • 用于动态供应的 Prow 集成
  • 易于集成的 REST API

…它为大规模共享测试基础设施提供了坚实的基础。

与 Prow 的 manual-trigger 组件的集成特别优雅——CIMaster 不需要知道如何创建集群,只需要知道何时请求它们。这种关注点分离允许基础设施团队独立演进集群供应策略。

无论您是为大型组织构建 CI 基础设施,还是希望优化 Kubernetes 平台中的资源利用,CIMaster 展示的模式都为分布式系统协调提供了宝贵的见解。

链接和资源


本文探讨了 CIMaster 的内部架构,这是一个生产环境的集群协调服务。所有代码示例均来自实际实现。

0%