深入理解 CIMaster:大规模 CI 集群智能协调系统

在现代云原生开发中,持续集成(CI)流水线是软件交付的基石。在大规模场景下,管理共享测试基础设施成为一个关键挑战。这就是 CIMaster 发挥作用的地方——一个精巧的集群管理服务,旨在协调对共享 CI 测试集群的访问,确保资源的高效利用并防止测试冲突。

问题背景:大规模共享 CI 基础设施

在大型组织中,每天运行成百上千个 CI 任务时,测试集群是需要高效共享的昂贵资源。主要挑战包括:

  1. 资源竞争:多个 CI 任务竞争有限的测试集群
  2. 集群状态管理:跟踪哪些集群可用、被占用或被保留用于调试
  3. 人工干预:开发者需要保留集群进行调查而不阻塞其他人
  4. 动态供应:当容量不足时按需创建新集群
  5. 生命周期管理:使用后或过期后自动释放集群

CIMaster 通过一个集中式协调服务解决了所有这些挑战。

架构概览

CIMaster 是一个用 Go 编写的 Kubernetes 原生服务,提供集群生命周期管理的 REST API。它由几个关键组件组成:

核心组件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
┌─────────────────────────────────────────────────────────────┐
│ CIMaster 服务 │
│ │
│ ┌──────────────────┐ ┌─────────────────────────┐ │
│ │ HTTP API 服务器 │ │ 集群管理器 │ │
│ │ (端口 8080) │◄─────►│ - 状态管理 │ │
│ │ │ │ - 分配逻辑 │ │
│ │ /getvacant │ │ - Hold 过期 │ │
│ │ /holdcluster │ │ │ │
│ │ /releasecluster │ │ │ │
│ │ /createcluster │ └───────────┬─────────────┘ │
│ │ ... │ │ │
│ └──────────────────┘ │ │
│ │ │
│ ┌──────────────────┐ ┌───────────▼─────────────┐ │
│ │ 指标服务器 │ │ Kubernetes ConfigMap │ │
│ │ (端口 8090) │ │ - cluster.json │ │
│ │ │ │ - 乐观锁 │ │
│ └──────────────────┘ └─────────────────────────┘ │
│ │
└────────────────┬─────────────────────────────────────────────┘

│ HTTP POST

┌───────────────────────┐
│ Prow Manual Trigger │
│ /manual-trigger │
└───────────────────────┘

1. 集群管理器 (cluster-manager.go)

系统的核心,负责:

  • 集群分配:查找并分配空闲集群给 CI 任务
  • Hold 管理:允许开发者预留集群用于调试(6 小时过期)
  • 自动清理:定期释放过期的 hold
  • 与 Prow 集成:通过 Prow 的 manual-trigger 端点触发集群创建

2. 集群操作 (cluster-ops.go)

实现 ClusterInterface 接口,包含以下操作:

  • OccupyVacantCluster:原子性地分配可用集群
  • FinishOccupiedCluster:将集群返回到可用池
  • HoldCluster/ReleaseCluster:手动 hold 管理
  • AddCluster/DeleteCluster:集群库存管理

3. 状态持久化

所有集群状态存储在 Kubernetes ConfigMap 中(ci 命名空间中的 clusters):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
{
"name": "cluster-01",
"region": "us-west",
"status": "testing",
"lastJob": "e2e-conformance",
"lastBuild": "12345",
"lastTriggerName": "john",
"hold": false,
"disabled": false,
"purpose": "tess-ci",
"osimage": "centos-atomic-7.6.1810-qcow2"
}
]

乐观锁使用 Kubernetes ResourceVersion 防止并发更新时的竞争条件。

与 Prow Manual Trigger 的集成

CIMaster 的一个强大功能是通过 manual-trigger 组件与 Prow 的集成。这使得当现有容量不足时能够动态供应集群。

什么是 Prow Manual Trigger?

Prow 是 Kubernetes 的 CI/CD 系统。manual-trigger 组件(/Users/tashen/test-infra/prow/cmd/manual-trigger)是一个 HTTP 服务器,允许在正常 GitHub webhook 流程之外以编程方式创建 ProwJob。

核心能力:

  • 接受带有任务规格的 HTTP POST 请求
  • 在 Kubernetes 中创建 ProwJob 自定义资源
  • 支持 presubmit、postsubmit 和 periodic 任务类型
  • 向任务注入环境变量(如 AUTHOR

CIMaster 如何使用 Manual Trigger

当用户调用 CIMaster 的 /createcluster 端点时:

1
curl "http://cimaster:8080/createcluster?user=john&branch=master&job=e2e-k8s-1.32"

CIMaster 执行以下流程:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// 1. 构造 Prow 请求
prowRequest := types.ProwManualTriggerRequest{
Org: "tess",
Repo: "tessops",
BaseRef: "master", // 来自 branch 参数
ProwType: "postsubmit",
ProwJob: "e2e-k8s-1.32", // 集群创建任务
User: "john", // 设置 AUTHOR 环境变量
}

// 2. 发送到 Prow manual-trigger 端点
resp, err := http.Post(
"https://prow.tess.io/manual-trigger",
"application/json",
requestBody,
)

// 3. 返回状态给调用者

在 Prow 端,manual-trigger 服务:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// 1. 接收请求
func (s *server) handleManualTrigger(w http.ResponseWriter, r *http.Request) {
var req triggerRequest
json.NewDecoder(r.Body).Decode(&req)

// 2. 从配置中查找任务定义
postsubmits := cfg.PostsubmitsStatic[req.Org+"/"+req.Repo]
for _, p := range postsubmits {
if p.Name == req.ProwJob {
prowJob = createProwJobFromPostsubmit(p, req)
break
}
}

// 3. 注入 AUTHOR 环境变量
if req.User != "" {
addAuthorEnvToProwJob(prowJob, req.User)
}

// 4. 在 Kubernetes 中创建 ProwJob
prowJobClient.Create(ctx, prowJob, metav1.CreateOptions{})

// 5. 等待 BuildID 并返回状态链接
statusLink := fmt.Sprintf("https://prow.tess.io/prowjob?prowjob=%s", prowJob.Name)
logLink := fmt.Sprintf("https://prow.tess.io/log?job=%s&id=%s", req.ProwJob, buildID)
}

请求-响应流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
┌──────────┐         ┌──────────┐         ┌──────────────┐         ┌────────────┐
│ 用户 │ │ CIMaster │ │ Prow Manual │ │ Kubernetes │
│ │ │ │ │ Trigger │ │ │
└────┬─────┘ └────┬─────┘ └──────┬───────┘ └─────┬──────┘
│ │ │ │
│ POST /createcluster │ │
│ user=john │ │ │
├──────────────────►│ │ │
│ │ │ │
│ │ POST /manual-trigger│ │
│ │ {org, repo, prowjob}│ │
│ ├─────────────────────►│ │
│ │ │ │
│ │ │ 创建 ProwJob │
│ │ │ AUTHOR=john │
│ │ ├──────────────────────►│
│ │ │ │
│ │ │ ◄─────────────────── │
│ │ │ ProwJob 已创建 │
│ │ │ │
│ │ ◄────────────────────┤ │
│ │ {success, job_name, │ │
│ ◄────────────────┤ status_link} │ │
│ 集群创建已触发 │ │ │
│ │ │ │

触发的 ProwJob 通常运行基础设施即代码(如 Terraform 或 Ansible)来供应新的 Kubernetes 集群,一旦就绪就会被添加到 CIMaster 的池中。

集群生命周期状态机

集群在几个状态之间转换:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
┌───────────┐
│ finished │ ◄───────────────────┐
│ (空闲) │ │
└─────┬─────┘ │
│ │
│ /getvacant │
│ │
▼ │
┌──────────┐ │
│ testing │ │
│ (占用中) │ │
└────┬─────┘ │
│ │
│ /finishtest │
│ │
└────────────────────────────┘

Hold 状态(叠加):
┌─────────┐
│ hold= │
│ false │◄──── /releasecluster ────┐
└────┬────┘ │
│ │
│ /holdcluster │
│ │
▼ │
┌─────────┐ │
│ hold= │ │
│ true │────────────────────────────┘
└─────────┘ (6小时后自动过期)

关键特性和实现细节

1. 带重试逻辑的智能分配

CIMaster 实现了带抖动的指数退避来处理并发分配:

1
2
3
4
5
6
7
8
9
10
type RandomBackoff struct {
MinBackoff time.Duration
MaxBackoff time.Duration
rng *rand.Rand
}

func (rb *RandomBackoff) GetRetryInterval() time.Duration {
delta := rb.MaxBackoff - rb.MinBackoff
return rb.MinBackoff + time.Duration(rb.rng.Int63n(int64(delta)+1))
}

每个操作最多重试 3 次,使用随机的 50-200ms 退避时间,避免惊群问题。

2. 自动 Hold 过期

后台 goroutine 持续检查过期的 hold:

1
2
3
4
5
6
7
func (cm *ClusterManager) runCronReleaseHeldEnvs() {
for {
durationUntilNextExpire, err := cm.clearExpiredHolds()
timer.Reset(durationUntilNextExpire)
<-timer.C
}
}

这确保了如果开发者忘记释放,集群不会被无限期锁定。

3. 多用途集群支持

CIMaster 支持不同的集群类型:

  • **tess-ci**:标准 CI 测试集群
  • **tnet-ci**:具有 OS 镜像选择的网络特定测试集群

分配时会遵守用途和 OS 镜像要求:

1
2
3
4
5
6
if cluster.Purpose != purpose {
continue // 跳过不兼容的集群
}
if cluster.Purpose == TnetCI && cluster.OSImage != osimage {
continue // 跳过错误的 OS 镜像
}

4. 管理员授权

受保护的端点使用简单的基于文件的授权:

1
2
3
4
5
6
7
8
9
10
func checkUser(h http.HandlerFunc, users []string) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
userName := r.URL.Query().Get("name")
if !contains(users, userName) {
fmt.Fprintf(w, "user %s is not authorized", userName)
return
}
h(w, r)
}
}

管理员用户从 /botadmin/users 文件加载(分号分隔)。

5. 可观测性

  • Prometheus 指标:在端口 8090 暴露(/metrics
  • 结构化日志:所有操作都使用关联 ID 记录
  • 优雅关闭:120 秒的宽限期来处理进行中的请求

API 示例

为 CI 分配集群

1
2
3
4
5
6
7
8
# 为构建 #123 获取空闲集群
CLUSTER=$(curl -s "http://cimaster:8080/getvacant?build=123&job=e2e-test&email=ci-bot@ebay.com")
echo "使用集群: $CLUSTER"

# 运行测试...

# 将集群返回到池中
curl "http://cimaster:8080/finishtest?cluster=$CLUSTER"

调试工作流

1
2
3
4
5
6
7
8
# Hold 集群进行调查
curl "http://cimaster:8080/holdcluster?cluster=cluster-05&name=alice&desc=debugging+network+issue"

# 调查...
kubectl get pods -n test-namespace

# 完成后释放
curl "http://cimaster:8080/releasecluster?cluster=cluster-05&name=alice"

创建新集群

1
2
3
4
5
6
7
8
9
# 通过 Prow 触发集群创建
curl "http://cimaster:8080/createcluster?user=admin&branch=master&job=e2e-k8s-1.32"
# 响应: cluster creation triggered successfully: {...}

# 监控 Prow 任务
# https://prow.tess.io/prowjob?prowjob=<job-name>

# 就绪后,管理员将其添加到池中
curl "http://cimaster:8080/addcluster?cluster=cluster-20&region=eu-central&name=admin"

JSON API 支持

用于编程访问:

1
curl -H "Accept: application/json" "http://cimaster:8080/getvacant?build=123&job=test"

响应:

1
2
3
4
5
6
7
8
9
{
"name": "cluster-07",
"region": "us-west",
"status": "testing",
"lastBuild": "123",
"lastJob": "test",
"lastTriggerName": "john",
"purpose": "tess-ci"
}

部署

CIMaster 作为 Kubernetes Deployment 运行,具有 3 个副本以实现高可用:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: apps/v1
kind: Deployment
metadata:
name: cimaster
namespace: ci
spec:
replicas: 3
minReadySeconds: 90
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
template:
spec:
containers:
- name: cimaster
image: hub.tess.io/tess/cimaster:v0.0.37
command:
- cimaster
- --manageCluster=true
- --cluster-config-map=clusters
- --botAdminDir=/botadmin
- --prow-url=https://prow.tess.io/manual-trigger
- --default-prow-job=e2e-k8s-1.32
ports:
- containerPort: 8080 # API
- containerPort: 8090 # 指标

性能特性

  • 分配延迟:约 100-300ms(包括 ConfigMap 读写周期)
  • 重试开销:每次重试 50-200ms(最多 3 次尝试)
  • Hold 过期检查:每 10 分钟(默认)
  • 并发性:通过乐观锁对多个副本安全

实际应用影响

在 eBay 的 TESS 平台,CIMaster 管理:

  • 20+ 个跨多个区域的共享测试集群
  • 每天数百个来自不同团队的 CI 任务
  • 6 小时自动 hold 过期防止资源锁定
  • 亚秒级分配适用于大多数请求
  • 通过 Prow 集成实现动态扩展

未来增强

正在考虑的潜在改进:

  1. 优先级队列:允许关键任务跳过分配队列
  2. 集群健康检查:自动禁用不健康的集群
  3. 使用分析:跟踪分配模式并优化容量
  4. Webhook 通知:hold 过期的 Slack/邮件警报
  5. 多集群联邦:跨多个 Kubernetes 集群协调

结论

CIMaster 展示了一个相对简单的协调服务如何解决 CI/CD 基础设施中的复杂资源管理挑战。通过结合:

  • Kubernetes ConfigMap 中的有状态集群跟踪
  • 用于安全并发访问的乐观锁
  • 废弃 hold 的自动过期
  • 用于动态供应的 Prow 集成
  • 易于集成的 REST API

…它为大规模共享测试基础设施提供了坚实的基础。

与 Prow 的 manual-trigger 组件的集成特别优雅——CIMaster 不需要知道如何创建集群,只需要知道何时请求它们。这种关注点分离允许基础设施团队独立演进集群供应策略。

无论您是为大型组织构建 CI 基础设施,还是希望优化 Kubernetes 平台中的资源利用,CIMaster 展示的模式都为分布式系统协调提供了宝贵的见解。

链接和资源


本文探讨了 CIMaster 的内部架构,这是一个生产环境的集群协调服务。所有代码示例均来自实际实现。

Understanding CIMaster: Intelligent CI Cluster Coordination at Scale

In modern cloud-native development, continuous integration (CI) pipelines are the backbone of software delivery. At scale, managing shared test infrastructure becomes a critical challenge. This is where CIMaster comes in—a sophisticated cluster management service designed to coordinate access to shared CI test clusters, ensuring efficient resource utilization and preventing test conflicts.

The Problem: Shared CI Infrastructure at Scale

In large organizations running hundreds or thousands of CI jobs daily, test clusters are expensive resources that need to be shared efficiently. Key challenges include:

  1. Resource Contention: Multiple CI jobs competing for limited test clusters
  2. Cluster State Management: Tracking which clusters are available, occupied, or held for debugging
  3. Manual Intervention: Developers needing to hold clusters for investigation without blocking others
  4. Dynamic Provisioning: Creating new clusters on-demand when capacity is insufficient
  5. Lifecycle Management: Automatically releasing clusters after use or expiration

CIMaster addresses all these challenges through a centralized coordination service.

Architecture Overview

CIMaster is a Kubernetes-native service written in Go that provides a REST API for cluster lifecycle management. It consists of several key components:

Core Components

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
┌─────────────────────────────────────────────────────────────┐
│ CIMaster Service │
│ │
│ ┌──────────────────┐ ┌─────────────────────────┐ │
│ │ HTTP API Server │ │ Cluster Manager │ │
│ │ (Port 8080) │◄─────►│ - State Management │ │
│ │ │ │ - Allocation Logic │ │
│ │ /getvacant │ │ - Hold Expiration │ │
│ │ /holdcluster │ │ │ │
│ │ /releasecluster │ │ │ │
│ │ /createcluster │ └───────────┬─────────────┘ │
│ │ ... │ │ │
│ └──────────────────┘ │ │
│ │ │
│ ┌──────────────────┐ ┌───────────▼─────────────┐ │
│ │ Metrics Server │ │ Kubernetes ConfigMap │ │
│ │ (Port 8090) │ │ - cluster.json │ │
│ │ │ │ - Optimistic Locking │ │
│ └──────────────────┘ └─────────────────────────┘ │
│ │
└────────────────┬─────────────────────────────────────────────┘

│ HTTP POST

┌───────────────────────┐
│ Prow Manual Trigger │
│ /manual-trigger │
└───────────────────────┘

1. Cluster Manager (cluster-manager.go)

The heart of the system, responsible for:

  • Cluster Allocation: Finding and assigning vacant clusters to CI jobs
  • Hold Management: Allowing developers to reserve clusters for debugging (with 6-hour expiration)
  • Automatic Cleanup: Periodically releasing expired holds
  • Integration with Prow: Triggering cluster creation through Prow’s manual-trigger endpoint

2. Cluster Operations (cluster-ops.go)

Implements the ClusterInterface with operations like:

  • OccupyVacantCluster: Atomically allocate an available cluster
  • FinishOccupiedCluster: Return a cluster to the available pool
  • HoldCluster/ReleaseCluster: Manual hold management
  • AddCluster/DeleteCluster: Cluster inventory management

3. State Persistence

All cluster state is stored in a Kubernetes ConfigMap (clusters in the ci namespace):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[
{
"name": "cluster-01",
"region": "us-west",
"status": "testing",
"lastJob": "e2e-conformance",
"lastBuild": "12345",
"lastTriggerName": "john",
"hold": false,
"disabled": false,
"purpose": "tess-ci",
"osimage": "centos-atomic-7.6.1810-qcow2"
}
]

Optimistic Locking prevents race conditions during concurrent updates using Kubernetes ResourceVersion.

Integration with Prow’s Manual Trigger

One of CIMaster’s powerful features is its integration with Prow through the manual-trigger component. This enables dynamic cluster provisioning when existing capacity is insufficient.

What is Prow Manual Trigger?

Prow is Kubernetes’ CI/CD system. The manual-trigger component (/Users/tashen/test-infra/prow/cmd/manual-trigger) is an HTTP server that allows programmatic creation of ProwJobs outside the normal GitHub webhook flow.

Key Capabilities:

  • Accepts HTTP POST requests with job specifications
  • Creates ProwJob custom resources in Kubernetes
  • Supports presubmit, postsubmit, and periodic job types
  • Injects environment variables (like AUTHOR) into jobs

How Uses Manual Trigger

When a user calls ‘s /createcluster endpoint:

1
curl "http://cimaster:8080/createcluster?user=john&branch=master&job=e2e-k8s-1.32"

CIMaster performs the following flow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// 1. Construct Prow request
prowRequest := types.ProwManualTriggerRequest{
Org: "tess",
Repo: "tessops",
BaseRef: "master", // from branch parameter
ProwType: "postsubmit",
ProwJob: "e2e-k8s-1.32", // cluster creation job
User: "john", // sets AUTHOR env var
}

// 2. Send to Prow manual-trigger endpoint
resp, err := http.Post(
"https://prow.tess.io/manual-trigger",
"application/json",
requestBody,
)

// 3. Return status to caller

On the Prow side, the manual-trigger service:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// 1. Receives the request
func (s *server) handleManualTrigger(w http.ResponseWriter, r *http.Request) {
var req triggerRequest
json.NewDecoder(r.Body).Decode(&req)

// 2. Looks up the job definition from config
postsubmits := cfg.PostsubmitsStatic[req.Org+"/"+req.Repo]
for _, p := range postsubmits {
if p.Name == req.ProwJob {
prowJob = createProwJobFromPostsubmit(p, req)
break
}
}

// 3. Injects AUTHOR environment variable
if req.User != "" {
addAuthorEnvToProwJob(prowJob, req.User)
}

// 4. Creates ProwJob in Kubernetes
prowJobClient.Create(ctx, prowJob, metav1.CreateOptions{})

// 5. Waits for BuildID and returns status link
statusLink := fmt.Sprintf("https://prow.tess.io/prowjob?prowjob=%s", prowJob.Name)
logLink := fmt.Sprintf("https://prow.tess.io/log?job=%s&id=%s", req.ProwJob, buildID)
}

Request-Response Flow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
┌──────────┐         ┌──────────┐         ┌──────────────┐         ┌────────────┐
│ User │ │ <USER_NAME> │ │ Prow Manual │ │ Kubernetes │
│ │ │ │ │ Trigger │ │ │
└────┬─────┘ └────┬─────┘ └──────┬───────┘ └─────┬──────┘
│ │ │ │
│ POST /createcluster │ │
│ user=john │ │ │
├──────────────────►│ │ │
│ │ │ │
│ │ POST /manual-trigger│ │
│ │ {org, repo, prowjob}│ │
│ ├─────────────────────►│ │
│ │ │ │
│ │ │ Create ProwJob │
│ │ │ with AUTHOR=john │
│ │ ├──────────────────────►│
│ │ │ │
│ │ │ ◄─────────────────── │
│ │ │ ProwJob Created │
│ │ │ │
│ │ ◄────────────────────┤ │
│ │ {success, job_name, │ │
│ ◄────────────────┤ status_link} │ │
│ cluster creation │ │ │
│ triggered │ │ │
│ │ │ │

The triggered ProwJob typically runs infrastructure-as-code (like Terraform or Ansible) to provision a new Kubernetes cluster, which is then added to CIMaster’s pool once ready.

Cluster Lifecycle State Machine

Clusters transition through several states:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
┌───────────┐
│ finished │ ◄───────────────────┐
│ (vacant) │ │
└─────┬─────┘ │
│ │
│ /getvacant │
│ │
▼ │
┌──────────┐ │
│ testing │ │
│(occupied)│ │
└────┬─────┘ │
│ │
│ /finishtest │
│ │
└────────────────────────────┘

Hold State (overlay):
┌─────────┐
│ hold= │
│ false │◄──── /releasecluster ────┐
└────┬────┘ │
│ │
│ /holdcluster │
│ │
▼ │
┌─────────┐ │
│ hold= │ │
│ true │────────────────────────────┘
└─────────┘ (auto-expires in 6h)

Key Features and Implementation Details

1. Intelligent Allocation with Retry Logic

CIMaster implements exponential backoff with jitter to handle concurrent allocation:

1
2
3
4
5
6
7
8
9
10
type RandomBackoff struct {
MinBackoff time.Duration
MaxBackoff time.Duration
rng *rand.Rand
}

func (rb *RandomBackoff) GetRetryInterval() time.Duration {
delta := rb.MaxBackoff - rb.MinBackoff
return rb.MinBackoff + time.Duration(rb.rng.Int63n(int64(delta)+1))
}

Each operation retries up to 3 times with random 50-200ms backoff to avoid thundering herd problems.

2. Automatic Hold Expiration

A background goroutine continuously checks for expired holds:

1
2
3
4
5
6
7
func (cm *ClusterManager) runCronReleaseHeldEnvs() {
for {
durationUntilNextExpire, err := cm.clearExpiredHolds()
timer.Reset(durationUntilNextExpire)
<-timer.C
}
}

This ensures clusters don’t remain locked indefinitely if developers forget to release them.

3. Multi-Purpose Cluster Support

CIMaster supports different cluster types:

  • tess-ci: Standard CI test clusters
  • tnet-ci: Network-specific test clusters with OS image selection

Allocation respects purpose and OS image requirements:

1
2
3
4
5
6
if cluster.Purpose != purpose {
continue // skip incompatible clusters
}
if cluster.Purpose == TnetCI && cluster.OSImage != osimage {
continue // skip wrong OS image
}

4. Admin Authorization

Protected endpoints use a simple file-based authorization:

1
2
3
4
5
6
7
8
9
10
func checkUser(h http.HandlerFunc, users []string) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
userName := r.URL.Query().Get("name")
if !contains(users, userName) {
fmt.Fprintf(w, "user %s is not authorized", userName)
return
}
h(w, r)
}
}

Admin users are loaded from /botadmin/users file (semicolon-separated).

5. Observability

  • Prometheus Metrics: Exposed on port 8090 (/metrics)
  • Structured Logging: All operations logged with correlation IDs
  • Graceful Shutdown: 120-second grace period to handle in-flight requests

API Examples

Allocating a Cluster for CI

1
2
3
4
5
6
7
8
# Get a vacant cluster for build #123
CLUSTER=$(curl -s "http://cimaster:8080/getvacant?build=123&job=e2e-test&email=ci-bot@ebay.com")
echo "Using cluster: $CLUSTER"

# Run tests...

# Return cluster to pool
curl "http://cimaster:8080/finishtest?cluster=$CLUSTER"

Debugging Workflow

1
2
3
4
5
6
7
8
# Hold cluster for investigation
curl "http://cimaster:8080/holdcluster?cluster=cluster-05&name=alice&desc=debugging+network+issue"

# Investigate...
kubectl get pods -n test-namespace

# Release when done
curl "http://cimaster:8080/releasecluster?cluster=cluster-05&name=alice"

Creating a New Cluster

1
2
3
4
5
6
7
8
9
# Trigger cluster creation via Prow
curl "http://cimaster:8080/createcluster?user=alice&branch=master&job=e2e-k8s-1.32"
# Response: cluster creation triggered successfully: {...}

# Monitor Prow job
# https://prow.tess.io/prowjob?prowjob=<job-name>

# Once ready, admin adds it to the pool
curl "http://cimaster:8080/addcluster?cluster=cluster-20&region=eu-central&name=admin"

JSON API Support

For programmatic access:

1
curl -H "Accept: application/json" "http://cimaster:8080/getvacant?build=123&job=test"

Response:

1
2
3
4
5
6
7
8
9
{
"name": "cluster-07",
"region": "us-west",
"status": "testing",
"lastBuild": "123",
"lastJob": "test",
"lastTriggerName": "john",
"purpose": "tess-ci"
}

Deployment

CIMaster runs as a Kubernetes Deployment with 3 replicas for high availability:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
apiVersion: apps/v1
kind: Deployment
metadata:
name: cimaster
namespace: ci
spec:
replicas: 3
minReadySeconds: 90
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
template:
spec:
containers:
- name: cimaster
image: hub.tess.io/tess/cimaster:v0.0.37
command:
- cimaster
- --manageCluster=true
- --cluster-config-map=clusters
- --botAdminDir=/botadmin
- --prow-url=https://prow.tess.io/manual-trigger
- --default-prow-job=e2e-k8s-1.32
ports:
- containerPort: 8080 # API
- containerPort: 8090 # Metrics

Performance Characteristics

  • Allocation Latency: ~100-300ms (includes ConfigMap read-write cycle)
  • Retry Overhead: 50-200ms per retry (max 3 attempts)
  • Hold Expiration Check: Every 10 minutes (default)
  • Concurrency: Safe for multiple replicas via optimistic locking

Real-World Impact

At eBay’s TESS platform, CIMaster manages:

  • 20+ shared test clusters across multiple regions
  • Hundreds of CI jobs daily from various teams
  • 6-hour automatic hold expiration preventing resource lock-ups
  • Sub-second allocation for most requests
  • Dynamic scaling through Prow integration

Future Enhancements

Potential improvements being considered:

  1. Priority Queues: Allow critical jobs to jump the allocation queue
  2. Cluster Health Checks: Automatic disabling of unhealthy clusters
  3. Usage Analytics: Track allocation patterns and optimize capacity
  4. Webhook Notifications: Slack/email alerts for hold expirations
  5. Multi-Cluster Federation: Coordinate across multiple Kubernetes clusters

Conclusion

CIMaster demonstrates how a relatively simple coordination service can solve complex resource management challenges in CI/CD infrastructure. By combining:

  • Stateful cluster tracking in Kubernetes ConfigMaps
  • Optimistic locking for safe concurrent access
  • Automatic expiration for abandoned holds
  • Prow integration for dynamic provisioning
  • REST API for easy integration

…it provides a robust foundation for shared test infrastructure at scale.

The integration with Prow’s manual-trigger component is particularly elegant—CIMaster doesn’t need to know how to create clusters, only when to request them. This separation of concerns allows infrastructure teams to evolve cluster provisioning strategies independently.

Whether you’re building CI infrastructure for a large organization or looking to optimize resource utilization in your Kubernetes platform, the patterns demonstrated by CIMaster offer valuable insights into distributed system coordination.


This article explores the internal architecture of CIMaster, a production cluster coordination service. All code examples are from the actual implementation.

0%