Migrating a Kubernetes Cluster to Cgroup V2: Requirements, Enablement, and Compatibility

Cgroup V2 offers a unified hierarchy, better IO QoS — including buffer IO throttling — and cleaner semantics compared to V1. This post documents the end-to-end process of migrating a production Kubernetes cluster to cgroup v2: component version requirements, kernel boot parameters, and compatibility verification results for CPU, memory, PID, hugetlb, and IO controllers.

Why Migrate to Cgroup V2

Problems with Cgroup V1

Cgroup V1 allows multiple independent hierarchies to coexist, one per controller. This leads to:

  • The same process appearing at different positions in different hierarchies
  • Inconsistent controller behavior that is difficult to reason about
  • Double-counting in resource statistics

What Cgroup V2 Improves

V2 merges all controllers into a single unified hierarchy:

1
2
3
4
5
6
7
8
9
10
/sys/fs/cgroup/
├── system.slice/
├── user.slice/
└── kubepods/ ← Kubernetes hierarchy, same path as V1
├── burstable/
│ └── pod-xxx/
│ ├── cpu.max
│ ├── memory.max
│ └── io.max
└── guaranteed/

Key improvements:

  • Unified hierarchy: all controllers under one directory tree
  • Buffer IO throttling: V1 only throttles Direct IO; V2 io.max applies to Buffer IO as well
  • IO weight scheduling: io.weight proportional bandwidth allocation with BFQ/CFQ schedulers
  • iocost model: latency-based IO cost model for smarter QoS control

The primary driver for migrating Kubernetes clusters to cgroup v2 is disk IO QoS — particularly the ability to throttle buffer IO, which is critical for improving resource utilization on nodes shared by mixed workloads.


Component Version Requirements

Component Minimum Recommended Notes
Linux Kernel 4.15 5.2+ 5.6+ required for hugetlb controller
Kubernetes 1.19 1.25+ Official cgroup v2 support from 1.19
containerd 1.4.4 1.6+ Must support unified cgroup hierarchy
runc v1.0.0-rc91 v1.0.0+ Full unified mode since rc93
systemd 244 249+ cpuset controller delegation from 244
cAdvisor v0.36.1 v0.45+ cgroup v2 metrics from v0.36.1

Enablement Steps

1. Upgrade the kernel

Kernel 5.4 or 5.15 is recommended. Kernels below 4.15 do not support cgroup v2.

2. Upgrade containerd

containerd 1.4.4+ automatically detects the host cgroup version. No additional configuration is required.

3. Set kernel boot parameters

Add to the GRUB command line or kernel parameters:

1
2
cgroup_no_v1=all
systemd.unified_cgroup_hierarchy=1

Optional parameters for IO debugging and optimization:

1
2
blk_cgroup.blkcg_debug_stats=1   # Enable blkcg debug statistics
scsi_mod.use_blk_mq=1 # Enable SCSI multi-queue

After reboot, verify:

1
2
3
4
5
6
# Confirm cgroup v2 is active
mount | grep cgroup
# Expected: cgroup2 on /sys/fs/cgroup type cgroup2

ls /sys/fs/cgroup/
# Should see: cgroup.controllers, cpu.stat, io.stat, etc.

4. Load the BFQ scheduler module

IO weight scheduling requires the BFQ scheduler:

1
2
3
modprobe bfq
# Switch a disk to BFQ
echo bfq > /sys/block/sda/queue/scheduler

Persist this across reboots by adding it to a systemd service or /etc/modules.

5. Upgrade or patch Kubernetes

Kubernetes 1.19+ includes native cgroup v2 support. Key upstream commits:

  • bb5ed1b7: initial cgroupv2 support in kubelet
  • a9772b22: CPU hard cap on cgroup v2
  • 79be8be1: make hugetlb cgroup optional (compatibility with older kernels)
  • 26d94ad6: skip device cgroup configuration under v2

For clusters running older Kubernetes versions, these commits can be cherry-picked individually.


Compatibility Verification

The following results are from a test environment running kernel 5.4, Kubernetes 1.23, and containerd 1.4.4.

CPU controller

Cgroup V2 uses cpu.max in place of V1’s cpu.cfs_quota_us + cpu.cfs_period_us:

1
2
3
4
5
6
7
8
9
# Pod-level (total CPU limit across all containers)
cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/cpu.max
# Example: 75600 100000
# Meaning: up to 75.6ms of CPU per 100ms period ≈ 0.756 cores

# Per-container limits
cat .../pod<uid>/<container-id>/cpu.max
# 25600 100000 → 0.256 cores (250m limit)
# 50000 100000 → 0.5 cores (500m limit)

Behavior is identical to V1. No application changes required.

Memory controller

memory.max replaces V1’s memory.limit_in_bytes:

1
2
3
4
5
6
7
8
9
10
11
# Pod total memory limit (128Mi + 64Mi = 192Mi)
cat .../pod<uid>/memory.max
# 201326592

# Container 1: 64Mi
cat .../pod<uid>/<container1>/memory.max
# 67108864

# Container 2: 128Mi
cat .../pod<uid>/<container2>/memory.max
# 134217728

Note: V2 introduces memory.min (hard guarantee), memory.low (soft guarantee), and memory.high (high watermark throttle). Kubernetes does not currently use these fields — they are set to 0 or max. Upstream KEP #2571 is tracking Memory QoS support.

PID controller

pids.max replaces V1’s pids.limit:

1
2
3
4
5
6
7
8
9
# kubelet config
cat /var/lib/kubelet/config.yaml | grep podPidsLimit
# podPidsLimit: 24

cat .../pod<uid>/pids.max
# 24

# When the limit is exceeded:
# sh: can't fork: Resource temporarily unavailable

Behavior is identical to V1.

Hugetlb controller

Kernel 5.4 does not support the hugetlb controller in cgroup v2. Kernel 5.6+ is required.

This is the most significant compatibility gap when migrating from V1 to V2. Workloads that request hugepages resources will not have cgroup-level isolation on kernels below 5.6. Assess the impact before migrating nodes that run hugetlb-consuming workloads.

IO controller (key new capability)

Cgroup V2 IO interfaces:

Interface Purpose
io.max Hard limit on bandwidth (BPS) or IOPS — applies to both Direct and Buffer IO
io.weight Proportional IO bandwidth allocation (requires BFQ/CFQ scheduler)
io.stat Per-device IO statistics
io.cost.qos Latency-based cost model QoS configuration
io.cost.model Physical disk model parameters for iocost

Usage examples:

1
2
3
4
5
6
7
8
9
# Enable iocost on device 8:0
echo "8:0 enable=1" > /sys/fs/cgroup/io.cost.qos

# Limit pod write bandwidth to 100 MB/s
echo "8:0 wbps=104857600" > .../pod<uid>/io.max

# Set IO weight ratio 1:9 between two containers
echo "100" > .../pod<uid>/<container1>/io.weight
echo "900" > .../pod<uid>/<container2>/io.weight

The critical advancement over V1 is Buffer IO throttling. V1’s blkio.throttle.write_bps_device only controls Direct IO — Buffer IO writes through the page cache and bypasses the throttle entirely. V2 tracks page cache writeback attribution per cgroup, enabling effective Buffer IO throttling. For most real-world workloads (databases, log writers), which use Buffer IO predominantly, this is transformative.


Migration Checklist

  1. Rolling node upgrade: kernel boot parameter changes require a reboot; use cordon + drain + reboot per node to roll smoothly
  2. Hugetlb assessment: verify that nodes running hugetlb workloads will run kernel 5.6+, or exclude them from the initial rollout
  3. BFQ persistence: modprobe bfq must be persistent across reboots via a systemd unit or /etc/modules
  4. iocost stability: io.cost.qos ctrl=auto can cause CPU soft lockups under high-concurrency IO; prefer ctrl=user in production until thoroughly validated
  5. Monitoring: update cAdvisor to v0.36.1+ to ensure cgroup v2 IO metrics are collected