Cgroup V2 IO Controller Benchmarks: io.max, io.weight, and iocost in Practice

Systematic benchmarks of the Cgroup V2 IO controller on production-grade Kubernetes nodes, covering io.max (hard bandwidth/IOPS limits), io.weight (proportional scheduling), and io.cost.qos (latency-based QoS), across Direct IO vs Buffer IO, raw disk vs LVM, and ext4 vs xfs combinations.

Test Environment

  • OS: Ubuntu 22.04 LTS (kernel 5.15)
  • Kubernetes: v1.23
  • Storage: 3 × 1.7T SSDs (sda/sdb/sdc), LVM VG spanning all three disks
  • Tools: fio + custom IO statistics scripts

Disk layout:

1
2
3
4
5
6
7
8
sda  1.7T
└─sda4 1.3T → LVM PV
sdb 1.7T
└─sdb1 1.7T → LVM PV
sdc 1.7T
└─sdc1 1.7T → LVM PV

VG: vg10000 (composed of sda4 + sdb1 + sdc1)

Base fio command:

1
2
3
4
5
fio -filename=/mnt/test/xfs/testfile \
-direct=0 -iodepth 64 -thread -rw=write \
-ioengine=libaio -bs=4k -size=20G \
-numjobs=1 -group_reporting --runtime=100 \
-name=mytest

Baseline Performance (No Throttling)

4K Sequential R/W Read BW Read IOPS Write BW Write IOPS
Ext4 Direct 359 MB/s 89,887 391 MB/s 97,825
Ext4 Buffer 336 MB/s 84,110 728 MB/s 182,086
XFS Direct 360 MB/s 89,997 400 MB/s 100,005
XFS Buffer 300 MB/s 74,910 780 MB/s 194,982

Buffer IO write bandwidth is significantly higher than Direct IO due to page cache write aggregation.


io.max Throttling Tests (Case 1)

Single volume, single thread, io.max applied to a single device.

Case IO Type FS Layer Config Actual Write BW Actual Write IOPS Disk Util
1-1 Direct ext4 raw wbps=100MB/s 100 MB/s 21,113 31%
1-2 Direct xfs raw wbps=100MB/s 100 MB/s 13,382 28%
1-3 Buffer ext4 raw wbps=100MB/s 100 MB/s 90 25%
1-4 Buffer xfs raw wbps=100MB/s 100 MB/s 90 25%
1-5 Direct xfs raw wiops=1000 4 MB/s 1,000 100%
1-6 Buffer xfs raw wiops=10 12.6 MB/s 14.5 3%
1-7 Direct xfs LVM wiops=1000 0.4 MB/s 100 4%
1-8 Buffer xfs LVM wiops=100 475 MB/s 2 1.6%
1-9 Direct xfs LVM wbps=100MB/s 100 MB/s 25,570 28%
1-10 Buffer xfs LVM wbps=100MB/s 100 MB/s 2 20%

Key Finding: Buffer IO IOPS throttling is inaccurate

Case 1-8 (Buffer IO, LVM, wiops=100): actual IOPS was only 2, but bandwidth hit 475 MB/s — far exceeding the intent. Case 1-6 is similar (wiops=10, actual 14.5 IOPS).

Reason: Buffer IO dirty page writeback uses a block size determined by the kernel’s writeback aggregation policy, not the 4K block size set in fio. With sufficient memory, many small writes are coalesced into large writes, so actual IOPS drops far below the configured limit while bandwidth may exceed it. With less memory, writeback frequency increases and block sizes are smaller.

Conclusion: io.max BPS throttling (wbps) is accurate for Buffer IO. IOPS throttling (wiops) is unreliable for Buffer IO.


Cross-Device and Multi-Volume Tests (Case 2)

Case Config Device 1 Actual Device 2 Actual LVM Total
2-1 dev1=100MB/s, dev2=unlimited 104 MB/s, 83 IOPS 489 MB/s, 446 IOPS IOPS fluctuates [0,100]
2-2 both devices 100MB/s 100 MB/s, 100 IOPS 100 MB/s, 100 IOPS 100 MB/s, [5,70] IOPS

Throttling on LVM volumes spanning multiple devices is applied at the underlying device granularity. LVM-level IOPS fluctuates due to striping across devices.


io.weight Proportional Scheduling Tests (Case 3)

Two LVM volumes, two fio threads, testing io.weight proportional scheduling.

Case Scenario Volume 1 Volume 2 Total
3-1.1 Buffer, no limit 250 MB/s 250 MB/s 500 MB/s, 100% util
3-1.2 Direct, no limit 220 MB/s 220 MB/s 440 MB/s, 100% util
3-2 Direct, weight 100:300, user qos 206 MB/s 209 MB/s 420 MB/s
3-4 Buffer, weight 100:300, user qos 129 MB/s 368 MB/s 500 MB/s
3-6 Direct, weight 100:900, user qos 153 MB/s 155 MB/s 304 MB/s
3-8 Buffer, weight 100:900, user qos 57 MB/s 442 MB/s 500 MB/s

Key Finding: Direct IO io.weight is ineffective under default iocost.qos

Cases 3-2 and 3-6: weights set to 100:300 and 100:900, but actual bandwidth is nearly equal (~206 vs ~209, ~153 vs ~155).

Reason: The default io.cost.qos automatically adjusts weights based on latency (ctrl=auto). When Direct IO saturates the disk at 100% utilization, the kernel overrides user-configured weights to meet latency targets.

Buffer IO weight scheduling works correctly (3-4: 129 vs 368 ≈ 1:2.8; 3-8: 57 vs 442 ≈ 1:7.8), close to the expected 1:3 and 1:9 ratios.

Conclusion: io.weight works well for Buffer IO. For Direct IO at full disk saturation, iocost automatic adjustment must be disabled (set min=max=100) for weights to take effect.


iocost QoS Precision Control Tests (Case 6)

Disable iocost auto-adjustment (min=max=100), two volumes at weight 100:900.

Case Additional Config Volume 1 Volume 2 Total
6-1 No io.max 22 MB/s, 83% util 309 MB/s, 85% util 322 MB/s
6-2 No io.max, weight 100:300 127 MB/s 291 MB/s 405 MB/s
6-3 io.max 150MB/s each 18 MB/s 145 MB/s 168 MB/s
6-4 io.max 100MB/s each 67 MB/s 102 MB/s 164 MB/s
6-5 v2 limit 10MB/s 48 MB/s 10 MB/s 58 MB/s

Case 6-1 with weight 100:900 shows an actual ratio of ~1:14 (22 vs 309), exceeding the expected 1:9. This is because disk utilization is ~85%, causing iocost to slightly adjust the allocation ratio. With auto-adjustment disabled (min=max=100), the actual ratio is closer to the configured value.


Three-Volume Weight Tests (Case 7)

Three LVM volumes with different weights, io.cost.qos set to min=max=100 (auto-adjustment disabled).

Case Weight Config Vol 1 Vol 2 Vol 3 Total
7-1 100:900:900 38 MB/s 6 MB/s 191 MB/s 235 MB/s
7-3 100:900:900, v2 limit 10MB/s 10 MB/s 9.6 MB/s 153 MB/s 172 MB/s
7-4 100:900:900, ctrl=auto 20 MB/s 112 MB/s 122 MB/s 254 MB/s
7-5 100:300:900 9.7 MB/s 30 MB/s 155 MB/s 194 MB/s
7-6 100:300:600 14 MB/s 43 MB/s 118 MB/s 175 MB/s
7-9 No weights (baseline) 121 MB/s 119 MB/s 119 MB/s 359 MB/s

Findings:

  1. Case 7-4 with ctrl=auto: vol3 (weight 900) achieved only 122 MB/s, barely higher than vol2 (112 MB/s) — far from the expected 1:9 ratio. Auto-adjustment overrides configured weights.
  2. Enabling io.weight reduces total bandwidth (235 vs 359 MB/s baseline). io.cost.qos can partially mitigate but not fully eliminate this overhead.
  3. Case 7-2 (ctrl=auto, no min=max) caused a CPU soft lockup on the node, terminating the test.

Large-Scale Pod Tests (Case 5 & 9)

100 pods, 10 fio threads each, testing the impact of IO scheduler and iocost on node-wide IO behavior.

Case 5: IO Scheduler Comparison

Scenario Load Avg CPU D-state Procs Node IO PSI Total IO BW
Unlimited (mq-deadline) 900 100% 30 910ms 300 MB/s
io.max throttled (mq-deadline) 548 68.4% 0 450ms 200 MB/s
io.max throttled (bfq) 700 93.4% 0 791ms 150 MB/s

Case 9: Container Count vs IO Throughput

Write throughput as container count increases, across different schedulers and control strategies:

Scenario 1 container 2 containers 6 containers 18 containers 50 containers
SSD xfs, bfq 500 MB/s 500 MB/s 220 MB/s 220 MB/s 300 MB/s
SSD xfs, mq-deadline 500 MB/s 500 MB/s 500 MB/s 500 MB/s 500 MB/s
SSD xfs, bfq+iocost 250 MB/s 300 MB/s 340 MB/s 16~490 MB/s 35~491 MB/s
SSD xfs, bfq+ioweight 40 MB/s 42 MB/s 65 MB/s 500 MB/s 500 MB/s
NVMe xfs, bfq 830 MB/s 900 MB/s 810 MB/s 850 MB/s 750 MB/s

Key Findings:

  1. BFQ throughput degrades significantly under multiple containers: On SSD, throughput drops from 500 MB/s to 220 MB/s at 6 containers, while mq-deadline holds a stable 500 MB/s throughout.
  2. iocost causes IO instability: With iocost enabled, throughput fluctuates wildly between 16–490 MB/s, with prolonged periods of very low IO before recovering.
  3. BFQ is relatively stable on NVMe: NVMe’s high IOPS characteristics reduce BFQ’s scheduling overhead, making throughput degradation under multiple containers more acceptable.

Summary and Recommendations

Feature Scenario Recommendation
io.max BPS throttling Direct IO / Buffer IO Reliable — use directly
io.max IOPS throttling Direct IO Reliable
io.max IOPS throttling Buffer IO Unreliable — avoid
io.weight Buffer IO Works well
io.weight Direct IO at full disk load Requires min=max=100 to disable iocost auto-adjustment
io.cost.qos ctrl=auto Any scenario Can cause CPU soft lockups under high-concurrency IO — use with caution in production
IO scheduler High-concurrency multi-container mq-deadline more stable than bfq; bfq suitable when weight scheduling is required

Core conclusion: Cgroup V2’s IO controller is powerful, but the mechanisms interact with each other (iocost can override io.weight; BFQ degrades under high concurrency). For production environments, start with io.max BPS throttling, validate thoroughly, then incrementally introduce weight scheduling and iocost.