Understanding CIMaster: Intelligent CI Cluster Coordination at Scale

Posted on 2026-04-06 In DevOps , Infrastructure

Understanding CIMaster: Intelligent CI Cluster Coordination at Scale

In modern cloud-native development, continuous integration (CI) pipelines are the backbone of software delivery. At scale, managing shared test infrastructure becomes a critical challenge. This is where CIMaster comes in—a sophisticated cluster management service designed to coordinate access to shared CI test clusters, ensuring efficient resource utilization and preventing test conflicts.

The Problem: Shared CI Infrastructure at Scale

In large organizations running hundreds or thousands of CI jobs daily, test clusters are expensive resources that need to be shared efficiently. Key challenges include:

Resource Contention: Multiple CI jobs competing for limited test clusters
Cluster State Management: Tracking which clusters are available, occupied, or held for debugging
Manual Intervention: Developers needing to hold clusters for investigation without blocking others
Dynamic Provisioning: Creating new clusters on-demand when capacity is insufficient
Lifecycle Management: Automatically releasing clusters after use or expiration

CIMaster addresses all these challenges through a centralized coordination service.

Architecture Overview

CIMaster is a Kubernetes-native service written in Go that provides a REST API for cluster lifecycle management. It consists of several key components:

Core Components

┌─────────────────────────────────────────────────────────────┐
│                        CIMaster Service                      │
│                                                              │
│  ┌──────────────────┐       ┌─────────────────────────┐    │
│  │  HTTP API Server │       │   Cluster Manager       │    │
│  │   (Port 8080)    │◄─────►│   - State Management    │    │
│  │                  │       │   - Allocation Logic    │    │
│  │  /getvacant      │       │   - Hold Expiration     │    │
│  │  /holdcluster    │       │                         │    │
│  │  /releasecluster │       │                         │    │
│  │  /createcluster  │       └───────────┬─────────────┘    │
│  │  ...             │                   │                   │
│  └──────────────────┘                   │                   │
│                                         │                   │
│  ┌──────────────────┐       ┌───────────▼─────────────┐    │
│  │ Metrics Server   │       │  Kubernetes ConfigMap   │    │
│  │  (Port 8090)     │       │  - cluster.json         │    │
│  │                  │       │  - Optimistic Locking   │    │
│  └──────────────────┘       └─────────────────────────┘    │
│                                                              │
└────────────────┬─────────────────────────────────────────────┘
                 │
                 │ HTTP POST
                 ▼
     ┌───────────────────────┐
     │  Prow Manual Trigger  │
     │  /manual-trigger      │
     └───────────────────────┘

1. Cluster Manager (`cluster-manager.go`)

The heart of the system, responsible for:

Cluster Allocation: Finding and assigning vacant clusters to CI jobs
Hold Management: Allowing developers to reserve clusters for debugging (with 6-hour expiration)
Automatic Cleanup: Periodically releasing expired holds
Integration with Prow: Triggering cluster creation through Prow’s manual-trigger endpoint

2. Cluster Operations (`cluster-ops.go`)

Implements the ClusterInterface with operations like:

OccupyVacantCluster: Atomically allocate an available cluster
FinishOccupiedCluster: Return a cluster to the available pool
HoldCluster/ReleaseCluster: Manual hold management
AddCluster/DeleteCluster: Cluster inventory management

3. State Persistence

All cluster state is stored in a Kubernetes ConfigMap (clusters in the ci namespace):

[
  {
    "name": "cluster-01",
    "region": "us-west",
    "status": "testing",
    "lastJob": "e2e-conformance",
    "lastBuild": "12345",
    "lastTriggerName": "john",
    "hold": false,
    "disabled": false,
    "purpose": "tess-ci",
    "osimage": "centos-atomic-7.6.1810-qcow2"
  }
]

Optimistic Locking prevents race conditions during concurrent updates using Kubernetes ResourceVersion.

Integration with Prow’s Manual Trigger

One of CIMaster’s powerful features is its integration with Prow through the manual-trigger component. This enables dynamic cluster provisioning when existing capacity is insufficient.

What is Prow Manual Trigger?

Prow is Kubernetes’ CI/CD system. The manual-trigger component (/Users/tashen/test-infra/prow/cmd/manual-trigger) is an HTTP server that allows programmatic creation of ProwJobs outside the normal GitHub webhook flow.

Key Capabilities:

Accepts HTTP POST requests with job specifications
Creates ProwJob custom resources in Kubernetes
Supports presubmit, postsubmit, and periodic job types
Injects environment variables (like AUTHOR) into jobs

How Uses Manual Trigger

When a user calls ‘s /createcluster endpoint:

1	curl "http://cimaster:8080/createcluster?user=john&branch=master&job=e2e-k8s-1.32"

CIMaster performs the following flow:

// 1. Construct Prow request
prowRequest := types.ProwManualTriggerRequest{
    Org:      "tess",
    Repo:     "tessops",
    BaseRef:  "master",              // from branch parameter
    ProwType: "postsubmit",
    ProwJob:  "e2e-k8s-1.32",        // cluster creation job
    User:     "john",                // sets AUTHOR env var
}

// 2. Send to Prow manual-trigger endpoint
resp, err := http.Post(
    "https://prow.tess.io/manual-trigger",
    "application/json",
    requestBody,
)

// 3. Return status to caller

On the Prow side, the manual-trigger service:

// 1. Receives the request
func (s *server) handleManualTrigger(w http.ResponseWriter, r *http.Request) {
    var req triggerRequest
    json.NewDecoder(r.Body).Decode(&req)

    // 2. Looks up the job definition from config
    postsubmits := cfg.PostsubmitsStatic[req.Org+"/"+req.Repo]
    for _, p := range postsubmits {
        if p.Name == req.ProwJob {
            prowJob = createProwJobFromPostsubmit(p, req)
            break
        }
    }

    // 3. Injects AUTHOR environment variable
    if req.User != "" {
        addAuthorEnvToProwJob(prowJob, req.User)
    }

    // 4. Creates ProwJob in Kubernetes
    prowJobClient.Create(ctx, prowJob, metav1.CreateOptions{})

    // 5. Waits for BuildID and returns status link
    statusLink := fmt.Sprintf("https://prow.tess.io/prowjob?prowjob=%s", prowJob.Name)
    logLink := fmt.Sprintf("https://prow.tess.io/log?job=%s&id=%s", req.ProwJob, buildID)
}

Request-Response Flow

┌──────────┐         ┌──────────┐         ┌──────────────┐         ┌────────────┐
│   User   │         │ <USER_NAME> │         │ Prow Manual  │         │ Kubernetes │
│          │         │          │         │   Trigger    │         │            │
└────┬─────┘         └────┬─────┘         └──────┬───────┘         └─────┬──────┘
     │                    │                      │                       │
     │  POST /createcluster                      │                       │
     │  user=john        │                      │                       │
     ├──────────────────►│                      │                       │
     │                   │                      │                       │
     │                   │  POST /manual-trigger│                       │
     │                   │  {org, repo, prowjob}│                       │
     │                   ├─────────────────────►│                       │
     │                   │                      │                       │
     │                   │                      │  Create ProwJob       │
     │                   │                      │  with AUTHOR=john     │
     │                   │                      ├──────────────────────►│
     │                   │                      │                       │
     │                   │                      │  ◄─────────────────── │
     │                   │                      │  ProwJob Created      │
     │                   │                      │                       │
     │                   │ ◄────────────────────┤                       │
     │                   │  {success, job_name, │                       │
     │  ◄────────────────┤   status_link}       │                       │
     │  cluster creation │                      │                       │
     │  triggered        │                      │                       │
     │                   │                      │                       │

The triggered ProwJob typically runs infrastructure-as-code (like Terraform or Ansible) to provision a new Kubernetes cluster, which is then added to CIMaster’s pool once ready.

Cluster Lifecycle State Machine

Clusters transition through several states:

┌───────────┐
│  finished │ ◄───────────────────┐
│ (vacant)  │                     │
└─────┬─────┘                     │
      │                           │
      │ /getvacant                │
      │                           │
      ▼                           │
┌──────────┐                      │
│ testing  │                      │
│(occupied)│                      │
└────┬─────┘                      │
     │                            │
     │ /finishtest                │
     │                            │
     └────────────────────────────┘

Hold State (overlay):
┌─────────┐
│ hold=   │
│ false   │◄──── /releasecluster ────┐
└────┬────┘                           │
     │                                │
     │ /holdcluster                   │
     │                                │
     ▼                                │
┌─────────┐                           │
│ hold=   │                           │
│ true    │────────────────────────────┘
└─────────┘    (auto-expires in 6h)

Key Features and Implementation Details

1. Intelligent Allocation with Retry Logic

CIMaster implements exponential backoff with jitter to handle concurrent allocation:

type RandomBackoff struct {
    MinBackoff time.Duration
    MaxBackoff time.Duration
    rng        *rand.Rand
}

func (rb *RandomBackoff) GetRetryInterval() time.Duration {
    delta := rb.MaxBackoff - rb.MinBackoff
    return rb.MinBackoff + time.Duration(rb.rng.Int63n(int64(delta)+1))
}

Each operation retries up to 3 times with random 50-200ms backoff to avoid thundering herd problems.

2. Automatic Hold Expiration

A background goroutine continuously checks for expired holds:

func (cm *ClusterManager) runCronReleaseHeldEnvs() {
    for {
        durationUntilNextExpire, err := cm.clearExpiredHolds()
        timer.Reset(durationUntilNextExpire)
        <-timer.C
    }
}

This ensures clusters don’t remain locked indefinitely if developers forget to release them.

3. Multi-Purpose Cluster Support

CIMaster supports different cluster types:

tess-ci: Standard CI test clusters
tnet-ci: Network-specific test clusters with OS image selection

Allocation respects purpose and OS image requirements:

if cluster.Purpose != purpose {
    continue  // skip incompatible clusters
}
if cluster.Purpose == TnetCI && cluster.OSImage != osimage {
    continue  // skip wrong OS image
}

4. Admin Authorization

Protected endpoints use a simple file-based authorization:

func checkUser(h http.HandlerFunc, users []string) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        userName := r.URL.Query().Get("name")
        if !contains(users, userName) {
            fmt.Fprintf(w, "user %s is not authorized", userName)
            return
        }
        h(w, r)
    }
}

Admin users are loaded from /botadmin/users file (semicolon-separated).

5. Observability

Prometheus Metrics: Exposed on port 8090 (/metrics)
Structured Logging: All operations logged with correlation IDs
Graceful Shutdown: 120-second grace period to handle in-flight requests

API Examples

Allocating a Cluster for CI

# Get a vacant cluster for build #123
CLUSTER=$(curl -s "http://cimaster:8080/getvacant?build=123&job=e2e-test&email=ci-bot@ebay.com")
echo "Using cluster: $CLUSTER"

# Run tests...

# Return cluster to pool
curl "http://cimaster:8080/finishtest?cluster=$CLUSTER"

Debugging Workflow

# Hold cluster for investigation
curl "http://cimaster:8080/holdcluster?cluster=cluster-05&name=alice&desc=debugging+network+issue"

# Investigate...
kubectl get pods -n test-namespace

# Release when done
curl "http://cimaster:8080/releasecluster?cluster=cluster-05&name=alice"

Creating a New Cluster

# Trigger cluster creation via Prow
curl "http://cimaster:8080/createcluster?user=alice&branch=master&job=e2e-k8s-1.32"
# Response: cluster creation triggered successfully: {...}

# Monitor Prow job
# https://prow.tess.io/prowjob?prowjob=<job-name>

# Once ready, admin adds it to the pool
curl "http://cimaster:8080/addcluster?cluster=cluster-20&region=eu-central&name=admin"

JSON API Support

For programmatic access:

1	curl -H "Accept: application/json" "http://cimaster:8080/getvacant?build=123&job=test"

Response:

{
  "name": "cluster-07",
  "region": "us-west",
  "status": "testing",
  "lastBuild": "123",
  "lastJob": "test",
  "lastTriggerName": "john",
  "purpose": "tess-ci"
}

Deployment

CIMaster runs as a Kubernetes Deployment with 3 replicas for high availability:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cimaster
  namespace: ci
spec:
  replicas: 3
  minReadySeconds: 90
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  template:
    spec:
      containers:
      - name: cimaster
        image: hub.tess.io/tess/cimaster:v0.0.37
        command:
        - cimaster
        - --manageCluster=true
        - --cluster-config-map=clusters
        - --botAdminDir=/botadmin
        - --prow-url=https://prow.tess.io/manual-trigger
        - --default-prow-job=e2e-k8s-1.32
        ports:
        - containerPort: 8080  # API
        - containerPort: 8090  # Metrics

Performance Characteristics

Allocation Latency: ~100-300ms (includes ConfigMap read-write cycle)
Retry Overhead: 50-200ms per retry (max 3 attempts)
Hold Expiration Check: Every 10 minutes (default)
Concurrency: Safe for multiple replicas via optimistic locking

Real-World Impact

At eBay’s TESS platform, CIMaster manages:

20+ shared test clusters across multiple regions
Hundreds of CI jobs daily from various teams
6-hour automatic hold expiration preventing resource lock-ups
Sub-second allocation for most requests
Dynamic scaling through Prow integration

Future Enhancements

Potential improvements being considered:

Priority Queues: Allow critical jobs to jump the allocation queue
Cluster Health Checks: Automatic disabling of unhealthy clusters
Usage Analytics: Track allocation patterns and optimize capacity
Webhook Notifications: Slack/email alerts for hold expirations
Multi-Cluster Federation: Coordinate across multiple Kubernetes clusters

Conclusion

CIMaster demonstrates how a relatively simple coordination service can solve complex resource management challenges in CI/CD infrastructure. By combining:

Stateful cluster tracking in Kubernetes ConfigMaps
Optimistic locking for safe concurrent access
Automatic expiration for abandoned holds
Prow integration for dynamic provisioning
REST API for easy integration

…it provides a robust foundation for shared test infrastructure at scale.

The integration with Prow’s manual-trigger component is particularly elegant—CIMaster doesn’t need to know how to create clusters, only when to request them. This separation of concerns allows infrastructure teams to evolve cluster provisioning strategies independently.

Whether you’re building CI infrastructure for a large organization or looking to optimize resource utilization in your Kubernetes platform, the patterns demonstrated by CIMaster offer valuable insights into distributed system coordination.

Links and Resources

CIMaster Repository: tess.io/contrib/cimaster
Prow Documentation: k8s.io/test-infra/prow
Manual Trigger Component: /Users/tashen/test-infra/prow/cmd/manual-trigger
API Guide: /Users/tashen/cimaster/doc/guide.md

This article explores the internal architecture of CIMaster, a production cluster coordination service. All code examples are from the actual implementation.

Understanding CIMaster: Intelligent CI Cluster Coordination at Scale

The Problem: Shared CI Infrastructure at Scale

Architecture Overview

Core Components

1. Cluster Manager (cluster-manager.go)

2. Cluster Operations (cluster-ops.go)

3. State Persistence

Integration with Prow’s Manual Trigger

What is Prow Manual Trigger?

How Uses Manual Trigger

Request-Response Flow

Cluster Lifecycle State Machine

Key Features and Implementation Details

1. Intelligent Allocation with Retry Logic

2. Automatic Hold Expiration

3. Multi-Purpose Cluster Support

4. Admin Authorization

5. Observability

API Examples

Allocating a Cluster for CI

Debugging Workflow

Creating a New Cluster

JSON API Support

Deployment

Performance Characteristics

Real-World Impact

Future Enhancements

Conclusion

Links and Resources

1. Cluster Manager (`cluster-manager.go`)

2. Cluster Operations (`cluster-ops.go`)