Understanding CIMaster: Intelligent CI Cluster Coordination at Scale
Understanding CIMaster: Intelligent CI Cluster Coordination at Scale
In modern cloud-native development, continuous integration (CI) pipelines are the backbone of software delivery. At scale, managing shared test infrastructure becomes a critical challenge. This is where CIMaster comes in—a sophisticated cluster management service designed to coordinate access to shared CI test clusters, ensuring efficient resource utilization and preventing test conflicts.
The Problem: Shared CI Infrastructure at Scale
In large organizations running hundreds or thousands of CI jobs daily, test clusters are expensive resources that need to be shared efficiently. Key challenges include:
- Resource Contention: Multiple CI jobs competing for limited test clusters
- Cluster State Management: Tracking which clusters are available, occupied, or held for debugging
- Manual Intervention: Developers needing to hold clusters for investigation without blocking others
- Dynamic Provisioning: Creating new clusters on-demand when capacity is insufficient
- Lifecycle Management: Automatically releasing clusters after use or expiration
CIMaster addresses all these challenges through a centralized coordination service.
Architecture Overview
CIMaster is a Kubernetes-native service written in Go that provides a REST API for cluster lifecycle management. It consists of several key components:
Core Components
1 | ┌─────────────────────────────────────────────────────────────┐ |
1. Cluster Manager (cluster-manager.go)
The heart of the system, responsible for:
- Cluster Allocation: Finding and assigning vacant clusters to CI jobs
- Hold Management: Allowing developers to reserve clusters for debugging (with 6-hour expiration)
- Automatic Cleanup: Periodically releasing expired holds
- Integration with Prow: Triggering cluster creation through Prow’s manual-trigger endpoint
2. Cluster Operations (cluster-ops.go)
Implements the ClusterInterface with operations like:
OccupyVacantCluster: Atomically allocate an available clusterFinishOccupiedCluster: Return a cluster to the available poolHoldCluster/ReleaseCluster: Manual hold managementAddCluster/DeleteCluster: Cluster inventory management
3. State Persistence
All cluster state is stored in a Kubernetes ConfigMap (clusters in the ci namespace):
1 | [ |
Optimistic Locking prevents race conditions during concurrent updates using Kubernetes ResourceVersion.
Integration with Prow’s Manual Trigger
One of CIMaster’s powerful features is its integration with Prow through the manual-trigger component. This enables dynamic cluster provisioning when existing capacity is insufficient.
What is Prow Manual Trigger?
Prow is Kubernetes’ CI/CD system. The manual-trigger component (/Users/tashen/test-infra/prow/cmd/manual-trigger) is an HTTP server that allows programmatic creation of ProwJobs outside the normal GitHub webhook flow.
Key Capabilities:
- Accepts HTTP POST requests with job specifications
- Creates ProwJob custom resources in Kubernetes
- Supports presubmit, postsubmit, and periodic job types
- Injects environment variables (like
AUTHOR) into jobs
How Uses Manual Trigger
When a user calls /createcluster endpoint:
1 | curl "http://cimaster:8080/createcluster?user=john&branch=master&job=e2e-k8s-1.32" |
CIMaster performs the following flow:
1 | // 1. Construct Prow request |
On the Prow side, the manual-trigger service:
1 | // 1. Receives the request |
Request-Response Flow
1 | ┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌────────────┐ |
The triggered ProwJob typically runs infrastructure-as-code (like Terraform or Ansible) to provision a new Kubernetes cluster, which is then added to CIMaster’s pool once ready.
Cluster Lifecycle State Machine
Clusters transition through several states:
1 | ┌───────────┐ |
Key Features and Implementation Details
1. Intelligent Allocation with Retry Logic
CIMaster implements exponential backoff with jitter to handle concurrent allocation:
1 | type RandomBackoff struct { |
Each operation retries up to 3 times with random 50-200ms backoff to avoid thundering herd problems.
2. Automatic Hold Expiration
A background goroutine continuously checks for expired holds:
1 | func (cm *ClusterManager) runCronReleaseHeldEnvs() { |
This ensures clusters don’t remain locked indefinitely if developers forget to release them.
3. Multi-Purpose Cluster Support
CIMaster supports different cluster types:
tess-ci: Standard CI test clusterstnet-ci: Network-specific test clusters with OS image selection
Allocation respects purpose and OS image requirements:
1 | if cluster.Purpose != purpose { |
4. Admin Authorization
Protected endpoints use a simple file-based authorization:
1 | func checkUser(h http.HandlerFunc, users []string) http.HandlerFunc { |
Admin users are loaded from /botadmin/users file (semicolon-separated).
5. Observability
- Prometheus Metrics: Exposed on port 8090 (
/metrics) - Structured Logging: All operations logged with correlation IDs
- Graceful Shutdown: 120-second grace period to handle in-flight requests
API Examples
Allocating a Cluster for CI
1 | # Get a vacant cluster for build #123 |
Debugging Workflow
1 | # Hold cluster for investigation |
Creating a New Cluster
1 | # Trigger cluster creation via Prow |
JSON API Support
For programmatic access:
1 | curl -H "Accept: application/json" "http://cimaster:8080/getvacant?build=123&job=test" |
Response:
1 | { |
Deployment
CIMaster runs as a Kubernetes Deployment with 3 replicas for high availability:
1 | apiVersion: apps/v1 |
Performance Characteristics
- Allocation Latency: ~100-300ms (includes ConfigMap read-write cycle)
- Retry Overhead: 50-200ms per retry (max 3 attempts)
- Hold Expiration Check: Every 10 minutes (default)
- Concurrency: Safe for multiple replicas via optimistic locking
Real-World Impact
At eBay’s TESS platform, CIMaster manages:
- 20+ shared test clusters across multiple regions
- Hundreds of CI jobs daily from various teams
- 6-hour automatic hold expiration preventing resource lock-ups
- Sub-second allocation for most requests
- Dynamic scaling through Prow integration
Future Enhancements
Potential improvements being considered:
- Priority Queues: Allow critical jobs to jump the allocation queue
- Cluster Health Checks: Automatic disabling of unhealthy clusters
- Usage Analytics: Track allocation patterns and optimize capacity
- Webhook Notifications: Slack/email alerts for hold expirations
- Multi-Cluster Federation: Coordinate across multiple Kubernetes clusters
Conclusion
CIMaster demonstrates how a relatively simple coordination service can solve complex resource management challenges in CI/CD infrastructure. By combining:
- Stateful cluster tracking in Kubernetes ConfigMaps
- Optimistic locking for safe concurrent access
- Automatic expiration for abandoned holds
- Prow integration for dynamic provisioning
- REST API for easy integration
…it provides a robust foundation for shared test infrastructure at scale.
The integration with Prow’s manual-trigger component is particularly elegant—CIMaster doesn’t need to know how to create clusters, only when to request them. This separation of concerns allows infrastructure teams to evolve cluster provisioning strategies independently.
Whether you’re building CI infrastructure for a large organization or looking to optimize resource utilization in your Kubernetes platform, the patterns demonstrated by CIMaster offer valuable insights into distributed system coordination.
Links and Resources
- CIMaster Repository:
tess.io/contrib/cimaster - Prow Documentation: k8s.io/test-infra/prow
- Manual Trigger Component:
/Users/tashen/test-infra/prow/cmd/manual-trigger - API Guide: /Users/tashen/cimaster/doc/guide.md
This article explores the internal architecture of CIMaster, a production cluster coordination service. All code examples are from the actual implementation.