Migrating a Kubernetes Cluster to Cgroup V2: Requirements, Enablement, and Compatibility
Cgroup V2 offers a unified hierarchy, better IO QoS — including buffer IO throttling — and cleaner semantics compared to V1. This post documents the end-to-end process of migrating a production Kubernetes cluster to cgroup v2: component version requirements, kernel boot parameters, and compatibility verification results for CPU, memory, PID, hugetlb, and IO controllers.
Why Migrate to Cgroup V2
Problems with Cgroup V1
Cgroup V1 allows multiple independent hierarchies to coexist, one per controller. This leads to:
- The same process appearing at different positions in different hierarchies
- Inconsistent controller behavior that is difficult to reason about
- Double-counting in resource statistics
What Cgroup V2 Improves
V2 merges all controllers into a single unified hierarchy:
1 | /sys/fs/cgroup/ |
Key improvements:
- Unified hierarchy: all controllers under one directory tree
- Buffer IO throttling: V1 only throttles Direct IO; V2
io.maxapplies to Buffer IO as well - IO weight scheduling:
io.weightproportional bandwidth allocation with BFQ/CFQ schedulers - iocost model: latency-based IO cost model for smarter QoS control
The primary driver for migrating Kubernetes clusters to cgroup v2 is disk IO QoS — particularly the ability to throttle buffer IO, which is critical for improving resource utilization on nodes shared by mixed workloads.
Component Version Requirements
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| Linux Kernel | 4.15 | 5.2+ | 5.6+ required for hugetlb controller |
| Kubernetes | 1.19 | 1.25+ | Official cgroup v2 support from 1.19 |
| containerd | 1.4.4 | 1.6+ | Must support unified cgroup hierarchy |
| runc | v1.0.0-rc91 | v1.0.0+ | Full unified mode since rc93 |
| systemd | 244 | 249+ | cpuset controller delegation from 244 |
| cAdvisor | v0.36.1 | v0.45+ | cgroup v2 metrics from v0.36.1 |
Enablement Steps
1. Upgrade the kernel
Kernel 5.4 or 5.15 is recommended. Kernels below 4.15 do not support cgroup v2.
2. Upgrade containerd
containerd 1.4.4+ automatically detects the host cgroup version. No additional configuration is required.
3. Set kernel boot parameters
Add to the GRUB command line or kernel parameters:
1 | cgroup_no_v1=all |
Optional parameters for IO debugging and optimization:
1 | blk_cgroup.blkcg_debug_stats=1 # Enable blkcg debug statistics |
After reboot, verify:
1 | # Confirm cgroup v2 is active |
4. Load the BFQ scheduler module
IO weight scheduling requires the BFQ scheduler:
1 | modprobe bfq |
Persist this across reboots by adding it to a systemd service or /etc/modules.
5. Upgrade or patch Kubernetes
Kubernetes 1.19+ includes native cgroup v2 support. Key upstream commits:
bb5ed1b7: initial cgroupv2 support in kubeleta9772b22: CPU hard cap on cgroup v279be8be1: make hugetlb cgroup optional (compatibility with older kernels)26d94ad6: skip device cgroup configuration under v2
For clusters running older Kubernetes versions, these commits can be cherry-picked individually.
Compatibility Verification
The following results are from a test environment running kernel 5.4, Kubernetes 1.23, and containerd 1.4.4.
CPU controller
Cgroup V2 uses cpu.max in place of V1’s cpu.cfs_quota_us + cpu.cfs_period_us:
1 | # Pod-level (total CPU limit across all containers) |
Behavior is identical to V1. No application changes required.
Memory controller
memory.max replaces V1’s memory.limit_in_bytes:
1 | # Pod total memory limit (128Mi + 64Mi = 192Mi) |
Note: V2 introduces
memory.min(hard guarantee),memory.low(soft guarantee), andmemory.high(high watermark throttle). Kubernetes does not currently use these fields — they are set to0ormax. Upstream KEP #2571 is tracking Memory QoS support.
PID controller
pids.max replaces V1’s pids.limit:
1 | # kubelet config |
Behavior is identical to V1.
Hugetlb controller
Kernel 5.4 does not support the hugetlb controller in cgroup v2. Kernel 5.6+ is required.
This is the most significant compatibility gap when migrating from V1 to V2. Workloads that request hugepages resources will not have cgroup-level isolation on kernels below 5.6. Assess the impact before migrating nodes that run hugetlb-consuming workloads.
IO controller (key new capability)
Cgroup V2 IO interfaces:
| Interface | Purpose |
|---|---|
io.max |
Hard limit on bandwidth (BPS) or IOPS — applies to both Direct and Buffer IO |
io.weight |
Proportional IO bandwidth allocation (requires BFQ/CFQ scheduler) |
io.stat |
Per-device IO statistics |
io.cost.qos |
Latency-based cost model QoS configuration |
io.cost.model |
Physical disk model parameters for iocost |
Usage examples:
1 | # Enable iocost on device 8:0 |
The critical advancement over V1 is Buffer IO throttling. V1’s blkio.throttle.write_bps_device only controls Direct IO — Buffer IO writes through the page cache and bypasses the throttle entirely. V2 tracks page cache writeback attribution per cgroup, enabling effective Buffer IO throttling. For most real-world workloads (databases, log writers), which use Buffer IO predominantly, this is transformative.
Migration Checklist
- Rolling node upgrade: kernel boot parameter changes require a reboot; use cordon + drain + reboot per node to roll smoothly
- Hugetlb assessment: verify that nodes running hugetlb workloads will run kernel 5.6+, or exclude them from the initial rollout
- BFQ persistence:
modprobe bfqmust be persistent across reboots via a systemd unit or/etc/modules - iocost stability:
io.cost.qos ctrl=autocan cause CPU soft lockups under high-concurrency IO; preferctrl=userin production until thoroughly validated - Monitoring: update cAdvisor to v0.36.1+ to ensure cgroup v2 IO metrics are collected