k8s-cgroup

问题:

  1. k8s中cpu的request和limit中设置”0.5”和”100m”表示什么意思?
  2. 一个4核8线程的cpu, 设置cpu为”1”的limit实际的限制多少?
  3. cpu中request和limit使用上有什么区别?
  4. cgroup中是如何限制的?

Linux Cgroup

众所周知, kubernetes和docker中对cpu, memory进行了使用限制, 用于内存和cpu的资源使用隔离. 而底层使用的是linux cgroup技术. 内存比较简单, 本文主要讲cpu的cgroup.

在cgroup里面,跟CPU相关的子系统有cpusetscpuacctcpu

其中cpuset主要用于设置CPU的亲和性,可以限制cgroup中的进程只能在指定的CPU上运行,或者不能在指定的CPU上运行,同时cpuset还能设置内存的亲和性。设置亲和性一般只在比较特殊的情况才用得着,所以这里不做介绍。

cpuacct包含当前cgroup所使用的CPU的统计信息,信息量较少,有兴趣可以去看看它的文档,这里不做介绍。

本篇只介绍cpu子系统,包括怎么限制cgroup的CPU使用上限及相对于其它cgroup的相对值。

创建子cgroup

在ubuntu下,systemd已经帮我们mount好了cpu子系统,我们只需要在相应的目录下创建子目录就可以了

1
2
3
4
5
6
7
8
9
10
11
12
13
#从这里的输出可以看到,cpuset被挂载在了/sys/fs/cgroup/cpuset,
#而cpu和cpuacct一起挂载到了/sys/fs/cgroup/cpu,cpuacct下面
dev@ubuntu:~$ mount|grep cpu
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)

#进入/sys/fs/cgroup/cpu,cpuacct并创建子cgroup
dev@ubuntu:~$ cd /sys/fs/cgroup/cpu,cpuacct
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct$ sudo mkdir test
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct$ cd test
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ ls
cgroup.clone_children cpuacct.stat cpuacct.usage_percpu cpu.cfs_quota_us cpu.stat tasks
cgroup.procs cpuacct.usage cpu.cfs_period_us cpu.shares notify_on_release

我们只需要关注cpu.开头的文件

cpu subsystem

cpu子系统调度cpu到cgroups中, 目前有两种调度策略:

  • Completely Fair Scheduler (CFS) —-将cpu时间划分成合适的份额, 按比例和权重分配给cgroup.cfs可以设置相对权重和绝对权重, 目前k8s用的是这个调度策略.
  • Real-Time scheduler (RT) —RT调度器与CFS中的绝对权重控制相似, 不过仅用于实时任务. 可以在运行时进行实时调整参数.
  1. Completely Fair Scheduler (CFS):
  • 强制绝对控制参数:

    cpu.cfs_period_us & cpu.cfs_quota_us

cfs_period_us用来配置时间周期长度,cfs_quota_us用来配置当前cgroup在设置的周期长度内所能使用的CPU时间数,两个文件配合起来设置CPU的使用上限。两个文件的单位都是微秒(us),cfs_period_us的取值范围为1毫秒(ms)到1秒(s),cfs_quota_us的取值大于1ms即可,如果cfs_quota_us的值为-1(默认值),表示不受cpu时间的限制。下面是几个例子:

1
2
3
4
5
6
7
8
9
10
11
1.限制只能使用1个CPU(每250ms能使用250ms的CPU时间)
# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
# echo 250000 > cpu.cfs_period_us /* period = 250ms */

2.限制使用2个CPU(内核)(每500ms能使用1000ms的CPU时间,即使用两个内核)
# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
# echo 500000 > cpu.cfs_period_us /* period = 500ms */

3.限制使用1个CPU的20%(每50ms能使用10ms的CPU时间,即使用一个CPU核心的20%)
# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
# echo 50000 > cpu.cfs_period_us /* period = 50ms */

cpu.stat:

包含了下面三项统计结果

  • nr_periods: 表示过去了多少个cpu.cfs_period_us里面配置的时间周期
  • nr_throttled: 在上面的这些周期中,有多少次是受到了限制(即cgroup中的进程在指定的时间周期中用光了它的配额)
  • throttled_time: cgroup中的进程被限制使用CPU持续了多长时间(纳秒)
  • 相对控制参数: cpu.shares

shares用来设置CPU的相对值,并且是针对所有的CPU(内核),默认值是1024,假如系统中有两个cgroup,分别是A和B,A的shares值是1024,B的shares值是512,那么A将获得1024/(1204+512)=66%的CPU资源,而B将获得33%的CPU资源。shares有两个特点:

  • 如果A不忙,没有使用到66%的CPU时间,那么剩余的CPU时间将会被系统分配给B,即B的CPU使用率可以超过33%
  • 如果添加了一个新的cgroup C,且它的shares值是1024,那么A的限额变成了1024/(1204+512+1024)=40%,B的变成了20%

从上面两个特点可以看出:

  • 在闲的时候,shares基本上不起作用,只有在CPU忙的时候起作用,这是一个优点。
  • 由于shares是一个绝对值,需要和其它cgroup的值进行比较才能得到自己的相对限额,而在一个部署很多容器的机器上,cgroup的数量是变化的,所以这个限额也是变化的,自己设置了一个高的值,但别人可能设置了一个更高的值,所以这个功能没法精确的控制CPU使用率。
  1. Real-Time scheduler (RT):

cpu.rt_period_us: 周期时间, us, 同cfs

cpu.rt_runtime_us: 最长持续周期, us. 例如设置rt_runtime_us = 200000, rt_period_us = 1000000, 这就是说如果node有2cpu, 那么每秒钟占用时间就是2*0.2 = 0.4s. 这个也是绝对时间.

运行事例:

以cfs为例.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#继续使用上面创建的子cgroup: test
#设置只能使用1个cpu的20%的时间
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ sudo sh -c "echo 50000 > cpu.cfs_period_us"
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ sudo sh -c "echo 10000 > cpu.cfs_quota_us"

#将当前bash加入到该cgroup
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ echo $$
5456
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ sudo sh -c "echo 5456 > cgroup.procs"

#在bash中启动一个死循环来消耗cpu,正常情况下应该使用100%的cpu(即消耗一个内核)
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ while :; do echo test > /dev/null; done

#--------------------------重新打开一个shell窗口----------------------
#通过top命令可以看到5456的CPU使用率为20%左右,说明被限制住了
#不过这时系统的%us+%sy在10%左右,那是因为我测试的机器上cpu是双核的,
#所以系统整体的cpu使用率为10%左右
dev@ubuntu:~$ top
Tasks: 139 total, 2 running, 137 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.6 us, 6.2 sy, 0.0 ni, 88.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 499984 total, 15472 free, 81488 used, 403024 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 383332 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5456 dev 20 0 22640 5472 3524 R 20.3 1.1 0:04.62 bash

#这时可以看到被限制的统计结果
dev@ubuntu:~$ cat /sys/fs/cgroup/cpu,cpuacct/test/cpu.stat
nr_periods 1436
nr_throttled 1304
throttled_time 51542291833

kubernetes 资源控制机制

kubernetes使用runc作为runtime, runc中通过cpuGroup对cpu子系统的的调度进行设置. 其中Set()用于初始设置, Apply()方法用于动态更改设置, . 原理就是往上述的指定文件写入相应条目和数字. 没有什么可说的.

内存的Set():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
func (s *MemoryGroup) Set(path string, cgroup *configs.Cgroup) error {
if err := setMemoryAndSwap(path, cgroup); err != nil {
return err
}

if cgroup.Resources.KernelMemory != 0 {
if err := setKernelMemory(path, cgroup.Resources.KernelMemory); err != nil {
return err
}
}

if cgroup.Resources.MemoryReservation != 0 {
if err := writeFile(path, "memory.soft_limit_in_bytes", strconv.FormatInt(cgroup.Resources.MemoryReservation, 10)); err != nil {
return err
}
}

if cgroup.Resources.KernelMemoryTCP != 0 {
if err := writeFile(path, "memory.kmem.tcp.limit_in_bytes", strconv.FormatInt(cgroup.Resources.KernelMemoryTCP, 10)); err != nil {
return err
}
}
if cgroup.Resources.OomKillDisable {
if err := writeFile(path, "memory.oom_control", "1"); err != nil {
return err
}
}
if cgroup.Resources.MemorySwappiness == nil || int64(*cgroup.Resources.MemorySwappiness) == -1 {
return nil
} else if *cgroup.Resources.MemorySwappiness <= 100 {
if err := writeFile(path, "memory.swappiness", strconv.FormatUint(*cgroup.Resources.MemorySwappiness, 10)); err != nil {
return err
}
} else {
return fmt.Errorf("invalid value:%d. valid memory swappiness range is 0-100", *cgroup.Resources.MemorySwappiness)
}

return nil
}

内存Apply():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
func (s *MemoryGroup) Apply(d *cgroupData) (err error) {
path, err := d.path("memory")
if err != nil && !cgroups.IsNotFound(err) {
return err
} else if path == "" {
return nil
}
if memoryAssigned(d.config) {
if _, err := os.Stat(path); os.IsNotExist(err) {
if err := os.MkdirAll(path, 0755); err != nil {
return err
}
// 只有内核内存可以动态设置
if err := EnableKernelMemoryAccounting(path); err != nil {
return err
}
}
}
...
}

func EnableKernelMemoryAccounting(path string) error {
// Check if kernel memory is enabled
// We have to limit the kernel memory here as it won't be accounted at all
// until a limit is set on the cgroup and limit cannot be set once the
// cgroup has children, or if there are already tasks in the cgroup.
for _, i := range []int64{1, -1} {
if err := setKernelMemory(path, i); err != nil {
return err
}
}
return nil
}

cpu的Set():

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
func (s *CpuGroup) Set(path string, cgroup *configs.Cgroup) error {
// 设置CFS
if cgroup.Resources.CpuShares != 0 {
if err := writeFile(path, "cpu.shares", strconv.FormatUint(cgroup.Resources.CpuShares, 10)); err != nil {
return err
}
}
if cgroup.Resources.CpuPeriod != 0 {
if err := writeFile(path, "cpu.cfs_period_us", strconv.FormatUint(cgroup.Resources.CpuPeriod, 10)); err != nil {
return err
}
}
if cgroup.Resources.CpuQuota != 0 {
if err := writeFile(path, "cpu.cfs_quota_us", strconv.FormatInt(cgroup.Resources.CpuQuota, 10)); err != nil {
return err
}
}
// 设置RT
if err := s.SetRtSched(path, cgroup); err != nil {
return err
}

return nil
}

func (s *CpuGroup) SetRtSched(path string, cgroup *configs.Cgroup) error {
if cgroup.Resources.CpuRtPeriod != 0 {
if err := writeFile(path, "cpu.rt_period_us", strconv.FormatUint(cgroup.Resources.CpuRtPeriod, 10)); err != nil {
return err
}
}
if cgroup.Resources.CpuRtRuntime != 0 {
if err := writeFile(path, "cpu.rt_runtime_us", strconv.FormatInt(cgroup.Resources.CpuRtRuntime, 10)); err != nil {
return err
}
}
return nil
}

cpu的Apply() 只能设置RT:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
func (s *CpuGroup) Apply(d *cgroupData) error {
// We always want to join the cpu group, to allow fair cpu scheduling
// on a container basis
path, err := d.path("cpu")
if err != nil && !cgroups.IsNotFound(err) {
return err
}
return s.ApplyDir(path, d.config, d.pid)
}

func (s *CpuGroup) ApplyDir(path string, cgroup *configs.Cgroup, pid int) error {
// This might happen if we have no cpu cgroup mounted.
// Just do nothing and don't fail.
if path == "" {
return nil
}
if err := os.MkdirAll(path, 0755); err != nil {
return err
}
// We should set the real-Time group scheduling settings before moving
// in the process because if the process is already in SCHED_RR mode
// and no RT bandwidth is set, adding it will fail.
if err := s.SetRtSched(path, cgroup); err != nil {
return err
}
// because we are not using d.join we need to place the pid into the procs file
// unlike the other subsystems
if err := cgroups.WriteCgroupProc(path, pid); err != nil {
return err
}

return nil
}
K8s Limits & Request 代码分析

k8s中管理cgroup的结构体在k8s.io/kubernetes/pkg/kubelet/cm/cgroup_manager_linux.go中:

1
2
3
4
5
6
7
8
9
10
11
// cgroupManagerImpl implements the CgroupManager interface.
// Its a stateless object which can be used to
// update,create or delete any number of cgroups
// It uses the Libcontainer raw fs cgroup manager for cgroup management.
type cgroupManagerImpl struct {
// subsystems holds information about all the
// mounted cgroup subsystems on the node
subsystems *CgroupSubsystems
// simplifies interaction with libcontainer and its cgroup managers
adapter *libcontainerAdapter
}

来看下Update方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Update updates the cgroup with the specified Cgroup Configuration
func (m *cgroupManagerImpl) Update(cgroupConfig *CgroupConfig) error {
...
// 提取cgroup资源参数
resourceConfig := cgroupConfig.ResourceParameters
resources := m.toResources(resourceConfig)

cgroupPaths := m.buildCgroupPaths(cgroupConfig.Name)

// 获得cgroupfs的位置.
abstractCgroupFsName := string(cgroupConfig.Name)
abstractParent := CgroupName(path.Dir(abstractCgroupFsName))
abstractName := CgroupName(path.Base(abstractCgroupFsName))

driverParent := m.adapter.adaptName(abstractParent, false)
driverName := m.adapter.adaptName(abstractName, false)

// 获取systemd的绝对位置
if m.adapter.cgroupManagerType == libcontainerSystemd {
driverName = m.adapter.adaptName(cgroupConfig.Name, false)
}

// 初始化cgroup配置
libcontainerCgroupConfig := &libcontainerconfigs.Cgroup{
Name: driverName,
Parent: driverParent,
Resources: resources,
Paths: cgroupPaths,
}

// 设置cgroup配置
if err := setSupportedSubsystems(libcontainerCgroupConfig); err != nil {
return fmt.Errorf("failed to set supported cgroup subsystems for cgroup %v: %v", cgroupConfig.Name, err)
}
return nil
}

先看下参数是如何提取的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
func (m *cgroupManagerImpl) toResources(resourceConfig *ResourceConfig) *libcontainerconfigs.Resources {
resources := &libcontainerconfigs.Resources{}
if resourceConfig == nil {
return resources
}
if resourceConfig.Memory != nil {
resources.Memory = *resourceConfig.Memory
}
if resourceConfig.CpuShares != nil {
resources.CpuShares = *resourceConfig.CpuShares
}
if resourceConfig.CpuQuota != nil {
resources.CpuQuota = *resourceConfig.CpuQuota
}
if resourceConfig.CpuPeriod != nil {
resources.CpuPeriod = *resourceConfig.CpuPeriod
}

// huge page参数提取, 略过
if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.HugePages) {
...
return resources
}

在来看看CpuShare, CpuQuota, 和CpuPeriod都来字哪里:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
// ResourceConfigForPod takes the input pod and outputs the cgroup resource config.
func ResourceConfigForPod(pod *v1.Pod) *ResourceConfig {
// sum requests and limits.
reqs, limits := resource.PodRequestsAndLimits(pod)

cpuRequests := int64(0)
cpuLimits := int64(0)
memoryLimits := int64(0)
if request, found := reqs[v1.ResourceCPU]; found {
cpuRequests = request.MilliValue()
}
if limit, found := limits[v1.ResourceCPU]; found {
cpuLimits = limit.MilliValue()
}
if limit, found := limits[v1.ResourceMemory]; found {
memoryLimits = limit.Value()
}

// convert to CFS values
cpuShares := MilliCPUToShares(cpuRequests)
cpuQuota, cpuPeriod := MilliCPUToQuota(cpuLimits)
...
}

// MilliCPUToShares converts the milliCPU to CFS shares.
func MilliCPUToShares(milliCPU int64) uint64 {
if milliCPU == 0 {
// Docker converts zero milliCPU to unset, which maps to kernel default
// for unset: 1024. Return 2 here to really match kernel default for
// zero milliCPU.
return MinShares
}
// Conceptually (milliCPU / milliCPUToCPU) * sharesPerCPU, but factored to improve rounding.
shares := (milliCPU * SharesPerCPU) / MilliCPUToCPU
if shares < MinShares {
return MinShares
}
return uint64(shares)
}

// MilliCPUToQuota converts milliCPU to CFS quota and period values.
func MilliCPUToQuota(milliCPU int64) (quota int64, period uint64) {
// CFS quota is measured in two values:
// - cfs_period_us=100ms (the amount of time to measure usage across)
// - cfs_quota=20ms (the amount of cpu time allowed to be used across a period)
// so in the above example, you are limited to 20% of a single CPU
// for multi-cpu environments, you just scale equivalent amounts

if milliCPU == 0 {
return
}

// we set the period to 100ms by default
period = QuotaPeriod

// we then convert your milliCPU to a value normalized over a period
quota = (milliCPU * QuotaPeriod) / MilliCPUToCPU

// quota needs to be a minimum of 1ms.
if quota < MinQuotaPeriod {
quota = MinQuotaPeriod
}

return
}

const (
// Taken from lmctfy https://github.com/google/lmctfy/blob/master/lmctfy/controllers/cpu_controller.cc
MinShares = 2
SharesPerCPU = 1024
MilliCPUToCPU = 1000

// 100000 is equivalent to 100ms
QuotaPeriod = 100000
MinQuotaPeriod = 1000
)

设置的代码:

1
2
3
4
5
6
7
8
9
10
11
12
func setSupportedSubsystems(cgroupConfig *libcontainerconfigs.Cgroup) error {
for _, sys := range getSupportedSubsystems() {
if _, ok := cgroupConfig.Paths[sys.Name()]; !ok {
return fmt.Errorf("Failed to find subsystem mount for subsystem: %v", sys.Name())
}
// 调用前面的cgroup.Set()方法
if err := sys.Set(cgroupConfig.Paths[sys.Name()], cgroupConfig); err != nil {
return fmt.Errorf("Failed to set config for supported subsystems : %v", err)
}
}
return nil
}

代码很明显了:

Request -> cpu.shares

Limits -> cpu.quota(CFS或者RT)

QuotaPeriod 为100ms

所以当Request设置0.5(0.5 1024=512), 等价于设置500m(500 1024/1000=512), 也就是512.

1
2
[root@tdc-tester04 ~]# cat /sys/fs/cgroup/cpu/kubepods/burstable/pod4de91174-9002-11e8-a663-ac1f6b83dd66/1cbc69fd7756efdcba38d11a57acab5cdbe0b9b2e5eef2a9e86e3c6c8850b1a1/cpu.shares 
512

当Limit设置1(1 1000 100000/1000=100000), 等价于1000m(1000 * 100000 / 1000 = 100000 ), 也就是quota=100000, period=100000.

1
2
3
4
[root@tdc-tester04 ~]# cat /sys/fs/cgroup/cpu/kubepods/burstable/pod4de91174-9002-11e8-a663-ac1f6b83dd66/1cbc69fd7756efdcba38d11a57acab5cdbe0b9b2e5eef2a9e86e3c6c8850b1a1/cpu.cfs_quota_us 
100000
[root@tdc-tester04 ~]# cat /sys/fs/cgroup/cpu/kubepods/burstable/pod4de91174-9002-11e8-a663-ac1f6b83dd66/1cbc69fd7756efdcba38d11a57acab5cdbe0b9b2e5eef2a9e86e3c6c8850b1a1/cpu.cfs_period_us
100000

至此所有路走通.

值得注意的点

1
2
3
4
5
6
The CPU resource is measured in cpu units. One cpu, in Kubernetes, is equivalent to:

1 AWS vCPU
1 GCP Core
1 Azure vCore
1 Hyperthread on a bare-metal Intel processor with Hyperthreading

从文档来看所以4核8线程对k8s来讲就是8核.

如果只设置了request没有设置limit, 意味着一个pod可以任意使用node资源, 如果没有其他pod创建. 如果集群管理员设置了LimitRange, 那么当pod没有设置limit的时候就会使用LimitRange里设置的值作为默认.

参考文献

  1. https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-default-namespace/#what-if-you-specify-a-container-s-request-but-not-its-limit
  2. https://segmentfault.com/a/1190000008323952
  3. https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu