Cgroup V2 IO Controller Benchmarks: io.max, io.weight, and iocost in Practice
Systematic benchmarks of the Cgroup V2 IO controller on production-grade Kubernetes nodes, covering io.max (hard bandwidth/IOPS limits), io.weight (proportional scheduling), and io.cost.qos (latency-based QoS), across Direct IO vs Buffer IO, raw disk vs LVM, and ext4 vs xfs combinations.
Test Environment
- OS: Ubuntu 22.04 LTS (kernel 5.15)
- Kubernetes: v1.23
- Storage: 3 × 1.7T SSDs (sda/sdb/sdc), LVM VG spanning all three disks
- Tools: fio + custom IO statistics scripts
Disk layout:
1 | sda 1.7T |
Base fio command:
1 | fio -filename=/mnt/test/xfs/testfile \ |
Baseline Performance (No Throttling)
| 4K Sequential R/W | Read BW | Read IOPS | Write BW | Write IOPS |
|---|---|---|---|---|
| Ext4 Direct | 359 MB/s | 89,887 | 391 MB/s | 97,825 |
| Ext4 Buffer | 336 MB/s | 84,110 | 728 MB/s | 182,086 |
| XFS Direct | 360 MB/s | 89,997 | 400 MB/s | 100,005 |
| XFS Buffer | 300 MB/s | 74,910 | 780 MB/s | 194,982 |
Buffer IO write bandwidth is significantly higher than Direct IO due to page cache write aggregation.
io.max Throttling Tests (Case 1)
Single volume, single thread, io.max applied to a single device.
| Case | IO Type | FS | Layer | Config | Actual Write BW | Actual Write IOPS | Disk Util |
|---|---|---|---|---|---|---|---|
| 1-1 | Direct | ext4 | raw | wbps=100MB/s | 100 MB/s | 21,113 | 31% |
| 1-2 | Direct | xfs | raw | wbps=100MB/s | 100 MB/s | 13,382 | 28% |
| 1-3 | Buffer | ext4 | raw | wbps=100MB/s | 100 MB/s | 90 | 25% |
| 1-4 | Buffer | xfs | raw | wbps=100MB/s | 100 MB/s | 90 | 25% |
| 1-5 | Direct | xfs | raw | wiops=1000 | 4 MB/s | 1,000 | 100% |
| 1-6 | Buffer | xfs | raw | wiops=10 | 12.6 MB/s | 14.5 | 3% |
| 1-7 | Direct | xfs | LVM | wiops=1000 | 0.4 MB/s | 100 | 4% |
| 1-8 | Buffer | xfs | LVM | wiops=100 | 475 MB/s | 2 | 1.6% |
| 1-9 | Direct | xfs | LVM | wbps=100MB/s | 100 MB/s | 25,570 | 28% |
| 1-10 | Buffer | xfs | LVM | wbps=100MB/s | 100 MB/s | 2 | 20% |
Key Finding: Buffer IO IOPS throttling is inaccurate
Case 1-8 (Buffer IO, LVM, wiops=100): actual IOPS was only 2, but bandwidth hit 475 MB/s — far exceeding the intent. Case 1-6 is similar (wiops=10, actual 14.5 IOPS).
Reason: Buffer IO dirty page writeback uses a block size determined by the kernel’s writeback aggregation policy, not the 4K block size set in fio. With sufficient memory, many small writes are coalesced into large writes, so actual IOPS drops far below the configured limit while bandwidth may exceed it. With less memory, writeback frequency increases and block sizes are smaller.
Conclusion: io.max BPS throttling (wbps) is accurate for Buffer IO. IOPS throttling (wiops) is unreliable for Buffer IO.
Cross-Device and Multi-Volume Tests (Case 2)
| Case | Config | Device 1 Actual | Device 2 Actual | LVM Total |
|---|---|---|---|---|
| 2-1 | dev1=100MB/s, dev2=unlimited | 104 MB/s, 83 IOPS | 489 MB/s, 446 IOPS | IOPS fluctuates [0,100] |
| 2-2 | both devices 100MB/s | 100 MB/s, 100 IOPS | 100 MB/s, 100 IOPS | 100 MB/s, [5,70] IOPS |
Throttling on LVM volumes spanning multiple devices is applied at the underlying device granularity. LVM-level IOPS fluctuates due to striping across devices.
io.weight Proportional Scheduling Tests (Case 3)
Two LVM volumes, two fio threads, testing io.weight proportional scheduling.
| Case | Scenario | Volume 1 | Volume 2 | Total |
|---|---|---|---|---|
| 3-1.1 | Buffer, no limit | 250 MB/s | 250 MB/s | 500 MB/s, 100% util |
| 3-1.2 | Direct, no limit | 220 MB/s | 220 MB/s | 440 MB/s, 100% util |
| 3-2 | Direct, weight 100:300, user qos | 206 MB/s | 209 MB/s | 420 MB/s |
| 3-4 | Buffer, weight 100:300, user qos | 129 MB/s | 368 MB/s | 500 MB/s |
| 3-6 | Direct, weight 100:900, user qos | 153 MB/s | 155 MB/s | 304 MB/s |
| 3-8 | Buffer, weight 100:900, user qos | 57 MB/s | 442 MB/s | 500 MB/s |
Key Finding: Direct IO io.weight is ineffective under default iocost.qos
Cases 3-2 and 3-6: weights set to 100:300 and 100:900, but actual bandwidth is nearly equal (~206 vs ~209, ~153 vs ~155).
Reason: The default io.cost.qos automatically adjusts weights based on latency (ctrl=auto). When Direct IO saturates the disk at 100% utilization, the kernel overrides user-configured weights to meet latency targets.
Buffer IO weight scheduling works correctly (3-4: 129 vs 368 ≈ 1:2.8; 3-8: 57 vs 442 ≈ 1:7.8), close to the expected 1:3 and 1:9 ratios.
Conclusion: io.weight works well for Buffer IO. For Direct IO at full disk saturation, iocost automatic adjustment must be disabled (set min=max=100) for weights to take effect.
iocost QoS Precision Control Tests (Case 6)
Disable iocost auto-adjustment (min=max=100), two volumes at weight 100:900.
| Case | Additional Config | Volume 1 | Volume 2 | Total |
|---|---|---|---|---|
| 6-1 | No io.max | 22 MB/s, 83% util | 309 MB/s, 85% util | 322 MB/s |
| 6-2 | No io.max, weight 100:300 | 127 MB/s | 291 MB/s | 405 MB/s |
| 6-3 | io.max 150MB/s each | 18 MB/s | 145 MB/s | 168 MB/s |
| 6-4 | io.max 100MB/s each | 67 MB/s | 102 MB/s | 164 MB/s |
| 6-5 | v2 limit 10MB/s | 48 MB/s | 10 MB/s | 58 MB/s |
Case 6-1 with weight 100:900 shows an actual ratio of ~1:14 (22 vs 309), exceeding the expected 1:9. This is because disk utilization is ~85%, causing iocost to slightly adjust the allocation ratio. With auto-adjustment disabled (min=max=100), the actual ratio is closer to the configured value.
Three-Volume Weight Tests (Case 7)
Three LVM volumes with different weights, io.cost.qos set to min=max=100 (auto-adjustment disabled).
| Case | Weight Config | Vol 1 | Vol 2 | Vol 3 | Total |
|---|---|---|---|---|---|
| 7-1 | 100:900:900 | 38 MB/s | 6 MB/s | 191 MB/s | 235 MB/s |
| 7-3 | 100:900:900, v2 limit 10MB/s | 10 MB/s | 9.6 MB/s | 153 MB/s | 172 MB/s |
| 7-4 | 100:900:900, ctrl=auto | 20 MB/s | 112 MB/s | 122 MB/s | 254 MB/s |
| 7-5 | 100:300:900 | 9.7 MB/s | 30 MB/s | 155 MB/s | 194 MB/s |
| 7-6 | 100:300:600 | 14 MB/s | 43 MB/s | 118 MB/s | 175 MB/s |
| 7-9 | No weights (baseline) | 121 MB/s | 119 MB/s | 119 MB/s | 359 MB/s |
Findings:
- Case 7-4 with
ctrl=auto: vol3 (weight 900) achieved only 122 MB/s, barely higher than vol2 (112 MB/s) — far from the expected 1:9 ratio. Auto-adjustment overrides configured weights. - Enabling
io.weightreduces total bandwidth (235 vs 359 MB/s baseline).io.cost.qoscan partially mitigate but not fully eliminate this overhead. - Case 7-2 (
ctrl=auto, nomin=max) caused a CPU soft lockup on the node, terminating the test.
Large-Scale Pod Tests (Case 5 & 9)
100 pods, 10 fio threads each, testing the impact of IO scheduler and iocost on node-wide IO behavior.
Case 5: IO Scheduler Comparison
| Scenario | Load Avg | CPU | D-state Procs | Node IO PSI | Total IO BW |
|---|---|---|---|---|---|
| Unlimited (mq-deadline) | 900 | 100% | 30 | 910ms | 300 MB/s |
| io.max throttled (mq-deadline) | 548 | 68.4% | 0 | 450ms | 200 MB/s |
| io.max throttled (bfq) | 700 | 93.4% | 0 | 791ms | 150 MB/s |
Case 9: Container Count vs IO Throughput
Write throughput as container count increases, across different schedulers and control strategies:
| Scenario | 1 container | 2 containers | 6 containers | 18 containers | 50 containers |
|---|---|---|---|---|---|
| SSD xfs, bfq | 500 MB/s | 500 MB/s | 220 MB/s | 220 MB/s | 300 MB/s |
| SSD xfs, mq-deadline | 500 MB/s | 500 MB/s | 500 MB/s | 500 MB/s | 500 MB/s |
| SSD xfs, bfq+iocost | 250 MB/s | 300 MB/s | 340 MB/s | 16~490 MB/s | 35~491 MB/s |
| SSD xfs, bfq+ioweight | 40 MB/s | 42 MB/s | 65 MB/s | 500 MB/s | 500 MB/s |
| NVMe xfs, bfq | 830 MB/s | 900 MB/s | 810 MB/s | 850 MB/s | 750 MB/s |
Key Findings:
- BFQ throughput degrades significantly under multiple containers: On SSD, throughput drops from 500 MB/s to 220 MB/s at 6 containers, while mq-deadline holds a stable 500 MB/s throughout.
- iocost causes IO instability: With iocost enabled, throughput fluctuates wildly between 16–490 MB/s, with prolonged periods of very low IO before recovering.
- BFQ is relatively stable on NVMe: NVMe’s high IOPS characteristics reduce BFQ’s scheduling overhead, making throughput degradation under multiple containers more acceptable.
Summary and Recommendations
| Feature | Scenario | Recommendation |
|---|---|---|
io.max BPS throttling |
Direct IO / Buffer IO | Reliable — use directly |
io.max IOPS throttling |
Direct IO | Reliable |
io.max IOPS throttling |
Buffer IO | Unreliable — avoid |
io.weight |
Buffer IO | Works well |
io.weight |
Direct IO at full disk load | Requires min=max=100 to disable iocost auto-adjustment |
io.cost.qos ctrl=auto |
Any scenario | Can cause CPU soft lockups under high-concurrency IO — use with caution in production |
| IO scheduler | High-concurrency multi-container | mq-deadline more stable than bfq; bfq suitable when weight scheduling is required |
Core conclusion: Cgroup V2’s IO controller is powerful, but the mechanisms interact with each other (iocost can override io.weight; BFQ degrades under high concurrency). For production environments, start with io.max BPS throttling, validate thoroughly, then incrementally introduce weight scheduling and iocost.