Kubernetes核心组件学习系列 - 完整指南与学习路线图
Kubernetes核心组件深度学习系列文章导航,提供系统性的学习路径和面试准备指南
Kubernetes核心组件深度学习系列文章导航,提供系统性的学习路径和面试准备指南
The scheduler is vLLM’s central orchestrator. Every microsecond, it makes critical decisions: which requests to process, how many tokens to compute, when to preempt requests, and how to maximize GPU utilization. This post dives deep into the scheduler’s algorithms and implementation.
At any moment, the scheduler must handle:
Unlike traditional batch processing, LLM serving faces unique challenges:
vLLM’s key innovation: continuous batching (also called iteration-level batching).
1 | # Static batching - wait for batch to fill |
Problems:
1 | # Continuous batching - add/remove every iteration |
Benefits:
Location: vllm/v1/core/sched/scheduler.py
1 | class Scheduler: |
Requests flow through states:
1 | WAITING |
With possible transitions:
WAITING → RUNNING: Scheduled for first timeRUNNING → RUNNING: Continues in batchRUNNING → PREEMPTED: Temporarily removed (if memory tight)PREEMPTED → WAITING: Returns to queueRUNNING → FINISHED: Generation completeWaiting Queue: New requests waiting to start
1 | class RequestQueue: |
Running List: Requests currently being processed
1 | # Simply a list - all are processed each iteration |
1 | def schedule(self) -> SchedulerOutput: |
Each iteration has a token budget (typically 2048-8192):
1 | token_budget = self.max_num_scheduled_tokens # e.g., 4096 |
Trade-off:
Running requests get priority:
1 | for request in self.running: |
Key insight: Running requests are usually in decode mode (1 token/iter), so they consume minimal token budget.
If token budget remains, admit new requests:
1 | while token_budget > 0 and len(self.waiting) > 0: |
When a request is preempted:
1 | def preempt_request(self, request: Request): |
Preemption is expensive: All computation is lost!
Strategy: Preempt requests with least progress to minimize waste.
Long prompts are split into chunks:
1 | # Prompt with 10,000 tokens, chunk_size=2048 |
Benefits:
Implementation:
1 | if request.num_computed_tokens < request.num_prompt_tokens: |
Requests can have priorities:
1 | request_high_priority = Request( |
Use cases:
For speculative decoding, schedule draft and target models:
1 | # Draft model proposes k=5 tokens |
Scheduler must:
Dynamically reorder to optimize cache hits:
1 | def reorder_for_cache_hits(waiting: list[Request]) -> list[Request]: |
The scheduler must ensure KV cache fits:
1 | def can_admit_request(self, request: Request, token_budget: int) -> bool: |
When memory is tight:
1 | if self.kv_cache_manager.get_num_free_blocks() < MIN_FREE_BLOCKS: |
For multi-modal models (images, audio):
1 | def can_fit_encoder_inputs(self, request: Request) -> bool: |
The scheduler tracks performance:
1 | @dataclass |
The scheduler balances multiple objectives:
Let’s trace a complete scheduling scenario:
1 | Waiting: [A (100 tokens), B (50 tokens)] |
1 | # Schedule A (new) |
GPU computes: A prefill (100 tokens) + B prefill (50 tokens)
1 | # Both in decode mode now |
GPU computes: A decode (1 token) + B decode (1 token) + C prefill (198 tokens)
1 | # A finishes (generated 50 tokens) |
1 | # C finishes prefill, now in decode |
vLLM supports different policies:
1 | # Simple queue |
Pros: Simple, fair
Cons: Head-of-line blocking
1 | # Priority heap |
Pros: Important requests first
Cons: Starvation of low-priority
1 | # Sort by estimated completion time |
Pros: Minimizes average latency
Cons: Long requests starve
Common problems and solutions:
1 | # Symptom: Many requests restarting |
1 | # Symptom: token_budget not fully used |
1 | # Symptom: TTFT too high |
Continuous batching eliminates head-of-line blocking by adding/removing requests every iteration
Token budget controls batch size and latency/throughput trade-off
Running requests get priority to minimize decode latency
Preemption is a last resort when memory is tight, discarding all progress
Chunked prefill balances long prompts with interactive requests
Prefix caching integration requires scheduler awareness for efficient admission
In Part 4, we’ll explore Request Processing - how requests flow through the system from tokenization to streaming output, and how state is managed across iterations.
The scheduler is vLLM’s brain, making thousands of split-second decisions to maximize GPU utilization while meeting latency SLAs. Next, we’ll see how requests actually flow through the system end-to-end.
PagedAttention is vLLM’s breakthrough innovation that revolutionized LLM serving. Before PagedAttention, serving LLMs efficiently was plagued by memory fragmentation and waste. This post dives deep into how PagedAttention works, why it matters, and how it’s implemented in vLLM.
In transformer models, attention layers compute key (K) and value (V) vectors for each token. During generation:
The challenge: We must store all previous K, V tensors (the “KV cache”) to avoid recomputation.
Memory size: For Llama-3-8B:
For a 2048-token sequence: 1 GB just for KV cache!
Traditional serving systems pre-allocate contiguous memory for each request’s maximum length:
1 | Request 1 (max 2048 tokens, actual 512): |
Problems:
Real impact: Without PagedAttention, you might serve 10 concurrent requests. With PagedAttention, you can serve 50+ requests with the same memory!
PagedAttention applies virtual memory concepts to attention computation:
Key idea: Instead of storing KV cache contiguously, break it into fixed-size blocks. Each request gets a block table mapping logical positions to physical blocks.
| Virtual Memory (OS) | PagedAttention (vLLM) |
|---|---|
| Page | Block (e.g., 16 tokens) |
| Page Table | Block Table |
| Physical Memory | GPU Memory |
| Memory Allocator | Block Pool |
| Page Fault | Cache Miss |
| Copy-on-Write | Prefix Sharing |
Instead of contiguous allocation:
1 | Request 1 (512 tokens, 32 blocks): |
Advantages:
Location: vllm/v1/core/kv_cache_utils.py
1 | class KVCacheBlock: |
Each block represents a fixed number of token positions (block_size, typically 16).
Physical storage:
1 | # In GPU memory, shape: [num_blocks, num_heads, block_size, head_dim] |
Location: vllm/v1/core/block_pool.py
The BlockPool manages all KV cache blocks:
1 | class BlockPool: |
Free Block Queue: Implements LRU eviction for cached blocks:
1 | free_block_queue: |
Allocation:
1 | def allocate(self) -> KVCacheBlock: |
Deallocation:
1 | def free(self, block: KVCacheBlock) -> None: |
Each request maintains a block table mapping logical to physical blocks:
1 | # Example: Request with 50 tokens, block_size=16 |
Lookup during attention:
1 | def get_physical_block(logical_block_idx: int) -> int: |
Location: vllm/v1/core/kv_cache_manager.py
The KVCacheManager orchestrates block allocation for requests:
1 | class KVCacheManager: |
One of PagedAttention’s killer features is prefix caching - sharing KV cache blocks across requests with common prefixes.
Each full block gets a hash based on its token content:
1 | def compute_block_hash( |
Example:
1 | Prompt A: "Translate to French: Hello world" |
But with a common system prompt:
1 | System: "You are a helpful assistant." |
When a new request arrives:
1 | def get_computed_blocks(self, request: Request) -> tuple[KVCacheBlocks, int]: |
Blocks use reference counting for safe sharing:
1 | # Request A uses blocks [7, 23, 102] |
Safety: A block is only freed when ref_cnt == 0.
When the cache is full:
1 | def evict_cached_block(self) -> KVCacheBlock: |
During prefill, we compute attention for the entire prompt:
1 | def paged_attention_prefill( |
In practice, vLLM uses optimized kernels (FlashAttention, FlashInfer) that work directly with block tables.
During decode, each request generates one token:
1 | def paged_attention_decode( |
The decode kernel is highly optimized:
vLLM supports models with different cache requirements per layer (MQA, GQA, Mamba):
1 | kv_cache_config = KVCacheConfig( |
Each request gets blocks from each group:
1 | # Request with 50 tokens |
Before PagedAttention (contiguous allocation):
1 | 10 requests × 2048 max length × 512 KB/token = 10.5 GB |
With PagedAttention (block allocation):
1 | 500 blocks allocated across all requests |
Result: 4x more concurrent requests!
Real benchmarks (Llama-3-8B on H100):
| Metric | Without PagedAttention | With PagedAttention | Improvement |
|---|---|---|---|
| Concurrent Requests | 12 | 64 | 5.3x |
| Throughput (tok/s) | 1,500 | 8,000 | 5.3x |
| Memory Usage | 60 GB | 60 GB | Same |
| Latency (TTFT) | 45ms | 42ms | -7% |
With a common system prompt (500 tokens):
| Requests | Without Caching | With Caching | Speedup |
|---|---|---|---|
| 1st | 10ms (prefill) | 10ms | 1x |
| 2nd | 10ms | 1ms | 10x |
| 100th | 10ms | 1ms | 10x |
Cache hit rate: Typically 60-80% for chatbots with system prompts.
For models like Mistral with sliding windows:
1 | # Only keep last 4096 tokens in cache |
When using speculative decoding:
1 | # Draft model proposes N tokens |
Block hashes can collide:
1 | # Different token sequences with same hash (rare!) |
Last block is often partial:
1 | # 50 tokens with block_size=16 |
Block allocation must be thread-safe:
1 | class BlockPool: |
PagedAttention breaks KV cache into fixed-size blocks, eliminating fragmentation and over-allocation
Block tables map logical to physical blocks, similar to virtual memory page tables
Prefix caching shares blocks across requests, dramatically reducing redundant computation
Reference counting ensures safe sharing, preventing premature deallocation
Near-zero memory waste enables 4-5x higher throughput in practice
Attention kernels are optimized to work directly with block-indexed storage
In Part 3, we’ll explore the Scheduler - vLLM’s brain that decides which requests to process, when to preempt, and how to maximize GPU utilization using continuous batching.
PagedAttention is the foundation that makes vLLM’s exceptional performance possible. In the next post, we’ll see how the Scheduler builds on top of this to orchestrate request execution.
Before diving into specific components, we need to understand vLLM’s overall architecture. This post maps out the major components and how they interact, providing the foundation for deeper exploration in subsequent parts.
vLLM is designed around a few key principles:
Here’s what happens when you send a request to vLLM:
1 | User Request (HTTP/gRPC) |
The V1 architecture (introduced in late 2024) uses a multi-process design for better CPU utilization and isolation. Let’s understand each process type:
Purpose: Handle HTTP/gRPC requests and I/O operations
Responsibilities:
Key Implementation: vllm/entrypoints/openai/api_server.py
The API server is stateless with respect to model execution. It doesn’t know about GPU memory, KV caches, or model weights. It simply:
EngineCoreRequest objectsEngineCoreOutput objects backProcess Count: By default, 1 API server. With data parallelism (--data-parallel-size N), automatically scales to N API servers.
CPU Threads: Uses VLLM_MEDIA_LOADING_THREAD_COUNT threads (default 8) for parallel media loading.
Purpose: Schedule requests and coordinate model execution
Responsibilities:
Key Implementation: vllm/v1/engine/core.py
The engine core runs a tight loop:
1 | while True: |
Process Count: One per data parallel rank. For --data-parallel-size 4, you get 4 engine cores.
CPU Usage: Runs a busy loop for low-latency scheduling decisions.
Purpose: Execute model forward passes on GPUs
Responsibilities:
Key Implementation: vllm/v1/worker/gpu_worker.py
Each GPU gets its own worker process. The worker:
Process Count: One per GPU. For 8 GPUs with --tensor-parallel-size 4 --data-parallel-size 2:
Purpose: Coordinate data parallel engines
Responsibilities:
Key Implementation: vllm/v1/engine/coordinator.py
Process Count: 1 if --data-parallel-size > 1, otherwise 0.
Let’s see some concrete examples:
1 | vllm serve meta-llama/Llama-3-8B |
Processes:
1 | vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 4 |
Processes:
1 | vllm serve meta-llama/Llama-3-8B --data-parallel-size 4 |
Processes:
1 | vllm serve meta-llama/Llama-3-70B --tensor-parallel-size 2 --data-parallel-size 4 |
Processes:
The LLMEngine class is the main entry point for offline inference (using the Python API directly).
Location: vllm/v1/engine/llm_engine.py
Key responsibilities:
InputProcessorOutputProcessorUsage example:
1 | from vllm import LLM, SamplingParams |
Under the hood, LLM creates an LLMEngine, which creates an EngineCore, which coordinates the workers.
The scheduler is the brain of vLLM. It decides:
Location: vllm/v1/core/sched/scheduler.py
The scheduler maintains three queues:
Scheduling algorithm (simplified):
1 | def schedule(self) -> SchedulerOutput: |
We’ll explore the scheduler in depth in Part 3.
Manages memory for attention key-value caches using PagedAttention.
Location: vllm/v1/core/kv_cache_manager.py
Key concepts:
Example: If block size is 16 and we have 1000 blocks, we can serve:
We’ll dive deep into PagedAttention in Part 2.
The worker process loads the model and executes forward passes.
Worker (vllm/v1/worker/gpu_worker.py):
ModelRunner (varies by backend):
Let’s trace a complete request through the system:
1 | POST /v1/completions |
[791, 3139, 315, 9822, 374]EngineCoreRequest object"req_abc123"data: {"text": "Paris", "finish_reason": null}vLLM’s configuration is centralized in VllmConfig:
1 | from vllm.config import VllmConfig |
All components receive the same VllmConfig object, ensuring consistency.
Key configs:
The class hierarchy follows a consistent pattern:
1 | LLMEngine |
Every class accepts VllmConfig in its constructor, providing access to all configuration options.
API servers and engine cores communicate via ZMQ:
1 | # API Server side |
Why ZMQ?
GPU workers use NCCL for collective operations:
1 | # Tensor parallelism: all-reduce across GPUs |
Communication patterns:
Understanding memory is crucial for vLLM:
1 | Total GPU Memory: 80GB (H100) |
For a 16-token block size with 8B model:
Typical performance on H100 GPU:
| Metric | Value |
|---|---|
| Throughput | ~8,000 tokens/sec (Llama-3-8B) |
| Latency (TTFT) | ~20-50ms |
| Latency (TPOT) | ~10-15ms |
| Max Batch Size | ~256 concurrent requests |
| Memory Efficiency | ~95% (vs ~60% without PagedAttention) |
Now that we understand the overall architecture, we can dive deeper into specific components:
API servers + Engine cores + GPU workers + (optional DP coordinator)VllmConfig for consistencyIn the next post, we’ll explore PagedAttention in detail, understanding how vLLM achieves near-zero memory waste through clever memory management.
This series provides a comprehensive technical deep-dive into vLLM, one of the most important open-source projects for LLM inference and serving. Originally developed at UC Berkeley’s Sky Computing Lab, vLLM has become the de facto standard for high-performance LLM serving in production environments.
prow-images 仓库是基于 Kubernetes Prow 构建的复杂 CI/CD 基础设施的核心组件。它作为专用容器镜像的集中式集合,为持续集成和交付流水线的各个方面提供动力。本文将深入探讨 prow-images 生态系统的架构、组件和工作流程,特别关注它与 prow-configs 仓库和 manual-trigger 服务之间的关系。
prow-images 仓库是一个包含超过 35 个不同专用容器镜像的单体仓库(monorepo),每个镜像都设计用于处理 Prow 作业中的特定任务。这些镜像从基本实用工具(如 Git 操作)到复杂工具(如 E2E 测试框架、Kubernetes 集群配置和自动安全 PR 生成)应有尽有。
仓库中的每个组件都遵循一致的结构:
DockerfileVERSION 文件(例如 v0.0.1)entrypoint 目录根目录的 Makefile 负责协调所有镜像的构建和推送到中央镜像仓库 hub.tess.io/prowimages/。
让我们深入了解组成这个生态系统的一些关键组件:
CI Generator 是生态系统中最关键的组件之一。它从简化的清单文件自动生成 Prow 作业规范。
主要特性:
.manifest 文件Build 和 UnitTest工作原理:
ci.manifest 文件onPr)、标签时 (onTag),支持正则表达式模式清单示例片段:
1 | tess/maintenance: |
Kaniko 镜像包装器提供了一种在 Kubernetes Pod 中安全构建容器镜像的方式,无需访问 Docker 守护进程。
功能:
在 Prow 作业中的使用模式:
1 | spec: |
Kind 镜像能够在 CI 流水线中创建临时 Kubernetes 集群用于 E2E 测试。
特性:
这个专用工具自动化创建跨多个集群的安全相关 RBAC 资源的拉取请求。
工作流程:
fcp、cluster、tessAppsAZ、tessNetAZ 或 tessMasterAZall: true 针对作用域中的所有集群使用示例:
1 | clusterRoles: |
这些组件处理拉取请求的自动批准和验证:
多个 E2E 测试镜像提供全面的测试能力:
各种专业的构建工具:
Git 相关实用工具:
用途:容器镜像定义和构建逻辑
位置:/Users/tashen/prow-images
内容:
构建过程:
1 | # 构建所有镜像 |
镜像仓库:所有镜像都推送到 hub.tess.io/prowimages/
用途:Prow 作业配置和 CI/CD 流水线定义
位置:/Users/tashen/prow-configs
结构:
1 | prow-configs/ |
关键文件:
ci.manifest:CI Generator 处理的简化作业定义*.yaml:自动生成的 Prow 作业规范工作流程:
ci.manifest 文件make jobgen 生成作业规范用途:手动作业触发服务
位置:/Users/tashen/test-infra/prow/cmd/manual-trigger
功能:HTTP 服务,允许在没有 GitHub 事件的情况下触发 Prow 作业
开发者操作(prow-configs):
1 | # 在 prow-configs/jobs/myorg/myrepo/ci.manifest |
CI 生成:
作业执行:
pr-${PULL_NUMBER}开发者需要测试特定提交:
1 | curl -X POST "http://manual-trigger.tessprow/manual-trigger" \ |
Manual Trigger 服务:
AUTHOR=developer-name 环境变量作业执行:
安全团队操作:
自动化工作流程:
开发者更新 Kaniko 镜像:
/Users/tashen/prow-images/kaniko/entrypoint/main.go/Users/tashen/prow-images/kaniko/VERSION 中将版本提升到 v0.0.2本地构建:
1 | cd /Users/tashen/prow-images |
CI 集成:
Prow-configs 更新:
latest 标签自动更新所有组件通过 hub.tess.io 的中央镜像仓库进行通信:
hub.tess.io/prowimages/CI Generator 在简单清单和复杂 Prow 配置之间建立桥梁:
.manifest 文件所有三个仓库都使用 Git 进行版本控制和触发:
一切都在 Kubernetes 上运行:
prow-images 中的每个镜像都维护一个 VERSION 文件:
1 | v0.0.1 |
这使得:
作业可以配置在不同的触发器上运行:
.manifest 文件而不是手写作业 YAMLmake jobgenpresubmit 用于 PR 测试,postsubmit 用于分支测试user 参数以进行审计跟踪1 | 1. 创建目录:prow-images/mytool/ |
1 | 1. 创建/编辑 ci.manifest 文件 |
1 | 1. 从 prow-configs 查找作业名称 |
1 | ┌─────────────────────────────────────────────────────────┐ |
问题:make image-kaniko 失败
问题:Prow 作业在 PR 上不触发
问题:”配置中未找到作业 X”
问题:作业使用旧镜像版本
:latest 标签强制拉取prow-images 生态系统代表了基于 Kubernetes 和 Prow 构建的全面的生产级 CI/CD 基础设施。三仓库架构提供了清晰的关注点分离:
这些组件共同实现了:
通过 CI Generator 的清单驱动方法显著降低了 Prow 配置的复杂性,使开发者能够轻松使用,同时保持 Kubernetes 原生 CI/CD 的全部功能。
无论您是构建新工具、添加测试作业还是手动触发部署,理解这三个仓库如何交互是有效使用这个强大 CI/CD 平台的关键。
最后更新:2026 年 4 月 8 日
欢迎来到 Linux 存储与文件系统深度剖析系列!本系列将从代码级别深入探讨 Linux 内核中的存储子系统和文件系统实现,基于 Linux 6.4-rc1 内核源码进行分析。
Linux 存储栈是一个复杂的分层架构,从用户空间到物理硬件,主要包含以下几层:
1 | ┌─────────────────────────────────────────────┐ |
在 Linux 内核中,VFS 使用以下核心数据结构:
1 | // include/linux/fs.h |
1 | // include/linux/blk_types.h |
基于 Linux 6.4-rc1 内核,以下是主要的源码文件位置:
fs/ - 文件系统核心代码fs/namei.c - 路径查找和解析fs/open.c - 文件打开操作fs/read_write.c - 读写操作fs/inode.c - inode 管理fs/dcache.c - dentry 缓存fs/super.c - 超级块管理block/ - 块设备层代码block/blk-core.c - 核心功能block/blk-mq.c - 多队列实现block/bio.c - bio 处理block/blk-merge.c - 请求合并block/blk-settings.c - 块设备设置block/mq-deadline.c - Deadline 调度器block/bfq-iosched.c - BFQ (Budget Fair Queueing) 调度器block/kyber-iosched.c - Kyber 调度器fs/ext4/ - Ext4 文件系统fs/btrfs/ - Btrfs 文件系统fs/xfs/ - XFS 文件系统mm/filemap.c - 文件映射和页缓存mm/readahead.c - 预读机制fs/buffer.c - 缓冲区缓存让我们通过一个简单的 read() 系统调用,看看数据是如何从磁盘流向用户空间的:
1 | 用户程序: read(fd, buffer, size) |
页缓存是 Linux 内存管理的核心组件之一,它缓存文件内容在内存中,避免重复的磁盘访问。
关键特性:
struct address_space 管理文件到物理页的映射代码位置: mm/filemap.c
块设备层是连接文件系统和设备驱动的中间层。
核心功能:
代码位置: block/
VFS 是所有文件系统的抽象层,提供统一的接口。
设计理念:
代码位置: fs/
为了更好地理解 VFS,让我们看一个极简的文件系统示例(基于 ramfs):
1 | // 超级块操作 |
在 Linux 存储栈中,以下是几个性能关键路径:
在学习和调试存储系统时,以下工具非常有用:
1 | # 查看块设备 I/O 统计 |
1 | # 动态追踪(需要 CONFIG_DYNAMIC_FTRACE) |
在下一篇文章中,我们将深入探讨 VFS 虚拟文件系统层,包括:
Documentation/filesystems/作者注: 本系列所有源码分析基于 Linux 6.4-rc1 内核版本。随着内核的演进,部分实现细节可能会有所变化,但核心设计理念保持相对稳定。
In modern cloud-native development, continuous integration (CI) pipelines are the backbone of software delivery. At scale, managing shared test infrastructure becomes a critical challenge. This is where CIMaster comes in—a sophisticated cluster management service designed to coordinate access to shared CI test clusters, ensuring efficient resource utilization and preventing test conflicts.
In large organizations running hundreds or thousands of CI jobs daily, test clusters are expensive resources that need to be shared efficiently. Key challenges include:
CIMaster addresses all these challenges through a centralized coordination service.
CIMaster is a Kubernetes-native service written in Go that provides a REST API for cluster lifecycle management. It consists of several key components:
1 | ┌─────────────────────────────────────────────────────────────┐ |
cluster-manager.go)The heart of the system, responsible for:
cluster-ops.go)Implements the ClusterInterface with operations like:
OccupyVacantCluster: Atomically allocate an available clusterFinishOccupiedCluster: Return a cluster to the available poolHoldCluster/ReleaseCluster: Manual hold managementAddCluster/DeleteCluster: Cluster inventory managementAll cluster state is stored in a Kubernetes ConfigMap (clusters in the ci namespace):
1 | [ |
Optimistic Locking prevents race conditions during concurrent updates using Kubernetes ResourceVersion.
One of CIMaster’s powerful features is its integration with Prow through the manual-trigger component. This enables dynamic cluster provisioning when existing capacity is insufficient.
Prow is Kubernetes’ CI/CD system. The manual-trigger component (/Users/tashen/test-infra/prow/cmd/manual-trigger) is an HTTP server that allows programmatic creation of ProwJobs outside the normal GitHub webhook flow.
Key Capabilities:
AUTHOR) into jobsWhen a user calls /createcluster endpoint:
1 | curl "http://cimaster:8080/createcluster?user=john&branch=master&job=e2e-k8s-1.32" |
CIMaster performs the following flow:
1 | // 1. Construct Prow request |
On the Prow side, the manual-trigger service:
1 | // 1. Receives the request |
1 | ┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌────────────┐ |
The triggered ProwJob typically runs infrastructure-as-code (like Terraform or Ansible) to provision a new Kubernetes cluster, which is then added to CIMaster’s pool once ready.
Clusters transition through several states:
1 | ┌───────────┐ |
CIMaster implements exponential backoff with jitter to handle concurrent allocation:
1 | type RandomBackoff struct { |
Each operation retries up to 3 times with random 50-200ms backoff to avoid thundering herd problems.
A background goroutine continuously checks for expired holds:
1 | func (cm *ClusterManager) runCronReleaseHeldEnvs() { |
This ensures clusters don’t remain locked indefinitely if developers forget to release them.
CIMaster supports different cluster types:
tess-ci: Standard CI test clusterstnet-ci: Network-specific test clusters with OS image selectionAllocation respects purpose and OS image requirements:
1 | if cluster.Purpose != purpose { |
Protected endpoints use a simple file-based authorization:
1 | func checkUser(h http.HandlerFunc, users []string) http.HandlerFunc { |
Admin users are loaded from /botadmin/users file (semicolon-separated).
/metrics)1 | # Get a vacant cluster for build #123 |
1 | # Hold cluster for investigation |
1 | # Trigger cluster creation via Prow |
For programmatic access:
1 | curl -H "Accept: application/json" "http://cimaster:8080/getvacant?build=123&job=test" |
Response:
1 | { |
CIMaster runs as a Kubernetes Deployment with 3 replicas for high availability:
1 | apiVersion: apps/v1 |
At eBay’s TESS platform, CIMaster manages:
Potential improvements being considered:
CIMaster demonstrates how a relatively simple coordination service can solve complex resource management challenges in CI/CD infrastructure. By combining:
…it provides a robust foundation for shared test infrastructure at scale.
The integration with Prow’s manual-trigger component is particularly elegant—CIMaster doesn’t need to know how to create clusters, only when to request them. This separation of concerns allows infrastructure teams to evolve cluster provisioning strategies independently.
Whether you’re building CI infrastructure for a large organization or looking to optimize resource utilization in your Kubernetes platform, the patterns demonstrated by CIMaster offer valuable insights into distributed system coordination.
tess.io/contrib/cimaster/Users/tashen/test-infra/prow/cmd/manual-triggerThis article explores the internal architecture of CIMaster, a production cluster coordination service. All code examples are from the actual implementation.
在现代云原生开发中,持续集成(CI)流水线是软件交付的基石。在大规模场景下,管理共享测试基础设施成为一个关键挑战。这就是 CIMaster 发挥作用的地方——一个精巧的集群管理服务,旨在协调对共享 CI 测试集群的访问,确保资源的高效利用并防止测试冲突。
在大型组织中,每天运行成百上千个 CI 任务时,测试集群是需要高效共享的昂贵资源。主要挑战包括:
CIMaster 通过一个集中式协调服务解决了所有这些挑战。
CIMaster 是一个用 Go 编写的 Kubernetes 原生服务,提供集群生命周期管理的 REST API。它由几个关键组件组成:
1 | ┌─────────────────────────────────────────────────────────────┐ |
cluster-manager.go)系统的核心,负责:
cluster-ops.go)实现 ClusterInterface 接口,包含以下操作:
OccupyVacantCluster:原子性地分配可用集群FinishOccupiedCluster:将集群返回到可用池HoldCluster/ReleaseCluster:手动 hold 管理AddCluster/DeleteCluster:集群库存管理所有集群状态存储在 Kubernetes ConfigMap 中(ci 命名空间中的 clusters):
1 | [ |
乐观锁使用 Kubernetes ResourceVersion 防止并发更新时的竞争条件。
CIMaster 的一个强大功能是通过 manual-trigger 组件与 Prow 的集成。这使得当现有容量不足时能够动态供应集群。
Prow 是 Kubernetes 的 CI/CD 系统。manual-trigger 组件(/Users/tashen/test-infra/prow/cmd/manual-trigger)是一个 HTTP 服务器,允许在正常 GitHub webhook 流程之外以编程方式创建 ProwJob。
核心能力:
AUTHOR)当用户调用 CIMaster 的 /createcluster 端点时:
1 | curl "http://cimaster:8080/createcluster?user=john&branch=master&job=e2e-k8s-1.32" |
CIMaster 执行以下流程:
1 | // 1. 构造 Prow 请求 |
在 Prow 端,manual-trigger 服务:
1 | // 1. 接收请求 |
1 | ┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌────────────┐ |
触发的 ProwJob 通常运行基础设施即代码(如 Terraform 或 Ansible)来供应新的 Kubernetes 集群,一旦就绪就会被添加到 CIMaster 的池中。
集群在几个状态之间转换:
1 | ┌───────────┐ |
CIMaster 实现了带抖动的指数退避来处理并发分配:
1 | type RandomBackoff struct { |
每个操作最多重试 3 次,使用随机的 50-200ms 退避时间,避免惊群问题。
后台 goroutine 持续检查过期的 hold:
1 | func (cm *ClusterManager) runCronReleaseHeldEnvs() { |
这确保了如果开发者忘记释放,集群不会被无限期锁定。
CIMaster 支持不同的集群类型:
tess-ci**:标准 CI 测试集群tnet-ci**:具有 OS 镜像选择的网络特定测试集群分配时会遵守用途和 OS 镜像要求:
1 | if cluster.Purpose != purpose { |
受保护的端点使用简单的基于文件的授权:
1 | func checkUser(h http.HandlerFunc, users []string) http.HandlerFunc { |
管理员用户从 /botadmin/users 文件加载(分号分隔)。
/metrics)1 | # 为构建 #123 获取空闲集群 |
1 | # Hold 集群进行调查 |
1 | # 通过 Prow 触发集群创建 |
用于编程访问:
1 | curl -H "Accept: application/json" "http://cimaster:8080/getvacant?build=123&job=test" |
响应:
1 | { |
CIMaster 作为 Kubernetes Deployment 运行,具有 3 个副本以实现高可用:
1 | apiVersion: apps/v1 |
在 eBay 的 TESS 平台,CIMaster 管理:
正在考虑的潜在改进:
CIMaster 展示了一个相对简单的协调服务如何解决 CI/CD 基础设施中的复杂资源管理挑战。通过结合:
…它为大规模共享测试基础设施提供了坚实的基础。
与 Prow 的 manual-trigger 组件的集成特别优雅——CIMaster 不需要知道如何创建集群,只需要知道何时请求它们。这种关注点分离允许基础设施团队独立演进集群供应策略。
无论您是为大型组织构建 CI 基础设施,还是希望优化 Kubernetes 平台中的资源利用,CIMaster 展示的模式都为分布式系统协调提供了宝贵的见解。
tess.io/contrib/cimaster/Users/tashen/test-infra/prow/cmd/manual-trigger本文探讨了 CIMaster 的内部架构,这是一个生产环境的集群协调服务。所有代码示例均来自实际实现。