k8s-cgroup

Posted on 2024-03-10 In cloud

问题:

k8s中cpu的request和limit中设置”0.5”和”100m”表示什么意思?
一个4核8线程的cpu, 设置cpu为”1”的limit实际的限制多少?
cpu中request和limit使用上有什么区别?
cgroup中是如何限制的?

Linux Cgroup

众所周知, kubernetes和docker中对cpu, memory进行了使用限制, 用于内存和cpu的资源使用隔离. 而底层使用的是linux cgroup技术. 内存比较简单, 本文主要讲cpu的cgroup.

在cgroup里面，跟CPU相关的子系统有cpusets、cpuacct和cpu。

其中cpuset主要用于设置CPU的亲和性，可以限制cgroup中的进程只能在指定的CPU上运行，或者不能在指定的CPU上运行，同时cpuset还能设置内存的亲和性。设置亲和性一般只在比较特殊的情况才用得着，所以这里不做介绍。

cpuacct包含当前cgroup所使用的CPU的统计信息，信息量较少，有兴趣可以去看看它的文档，这里不做介绍。

本篇只介绍cpu子系统，包括怎么限制cgroup的CPU使用上限及相对于其它cgroup的相对值。

创建子cgroup

在ubuntu下，systemd已经帮我们mount好了cpu子系统，我们只需要在相应的目录下创建子目录就可以了

#从这里的输出可以看到，cpuset被挂载在了/sys/fs/cgroup/cpuset，
#而cpu和cpuacct一起挂载到了/sys/fs/cgroup/cpu,cpuacct下面
dev@ubuntu:~$ mount|grep cpu
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)

#进入/sys/fs/cgroup/cpu,cpuacct并创建子cgroup
dev@ubuntu:~$ cd /sys/fs/cgroup/cpu,cpuacct
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct$ sudo mkdir test
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct$ cd test
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ ls
cgroup.clone_children  cpuacct.stat   cpuacct.usage_percpu  cpu.cfs_quota_us  cpu.stat           tasks
cgroup.procs           cpuacct.usage  cpu.cfs_period_us     cpu.shares        notify_on_release

我们只需要关注cpu.开头的文件

cpu subsystem

cpu子系统调度cpu到cgroups中, 目前有两种调度策略:

Completely Fair Scheduler (CFS) —-将cpu时间划分成合适的份额, 按比例和权重分配给cgroup.cfs可以设置相对权重和绝对权重, 目前k8s用的是这个调度策略.
Real-Time scheduler (RT) —RT调度器与CFS中的绝对权重控制相似, 不过仅用于实时任务. 可以在运行时进行实时调整参数.

Completely Fair Scheduler (CFS):

强制绝对控制参数:

cpu.cfs_period_us & cpu.cfs_quota_us

cfs_period_us用来配置时间周期长度，cfs_quota_us用来配置当前cgroup在设置的周期长度内所能使用的CPU时间数，两个文件配合起来设置CPU的使用上限。两个文件的单位都是微秒（us），cfs_period_us的取值范围为1毫秒（ms）到1秒（s），cfs_quota_us的取值大于1ms即可，如果cfs_quota_us的值为-1（默认值），表示不受cpu时间的限制。下面是几个例子：

1.限制只能使用1个CPU（每250ms能使用250ms的CPU时间）
    # echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
    # echo 250000 > cpu.cfs_period_us /* period = 250ms */

2.限制使用2个CPU（内核）（每500ms能使用1000ms的CPU时间，即使用两个内核）
    # echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
    # echo 500000 > cpu.cfs_period_us /* period = 500ms */

3.限制使用1个CPU的20%（每50ms能使用10ms的CPU时间，即使用一个CPU核心的20%）
    # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
    # echo 50000 > cpu.cfs_period_us /* period = 50ms */

cpu.stat:

包含了下面三项统计结果

nr_periods：表示过去了多少个cpu.cfs_period_us里面配置的时间周期
nr_throttled：在上面的这些周期中，有多少次是受到了限制（即cgroup中的进程在指定的时间周期中用光了它的配额）
throttled_time: cgroup中的进程被限制使用CPU持续了多长时间(纳秒)
相对控制参数: cpu.shares

shares用来设置CPU的相对值，并且是针对所有的CPU（内核），默认值是1024，假如系统中有两个cgroup，分别是A和B，A的shares值是1024，B的shares值是512，那么A将获得1024/(1204+512)=66%的CPU资源，而B将获得33%的CPU资源。shares有两个特点：

如果A不忙，没有使用到66%的CPU时间，那么剩余的CPU时间将会被系统分配给B，即B的CPU使用率可以超过33%
如果添加了一个新的cgroup C，且它的shares值是1024，那么A的限额变成了1024/(1204+512+1024)=40%，B的变成了20%

从上面两个特点可以看出：

在闲的时候，shares基本上不起作用，只有在CPU忙的时候起作用，这是一个优点。
由于shares是一个绝对值，需要和其它cgroup的值进行比较才能得到自己的相对限额，而在一个部署很多容器的机器上，cgroup的数量是变化的，所以这个限额也是变化的，自己设置了一个高的值，但别人可能设置了一个更高的值，所以这个功能没法精确的控制CPU使用率。

Real-Time scheduler (RT):

cpu.rt_period_us: 周期时间, us, 同cfs

cpu.rt_runtime_us: 最长持续周期, us. 例如设置rt_runtime_us = 200000, rt_period_us = 1000000, 这就是说如果node有2cpu, 那么每秒钟占用时间就是2*0.2 = 0.4s. 这个也是绝对时间.

运行事例:

以cfs为例.

#继续使用上面创建的子cgroup： test
#设置只能使用1个cpu的20%的时间
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ sudo sh -c "echo 50000 > cpu.cfs_period_us"
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ sudo sh -c "echo 10000 > cpu.cfs_quota_us"

#将当前bash加入到该cgroup
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ echo $$
5456
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ sudo sh -c "echo 5456 > cgroup.procs"

#在bash中启动一个死循环来消耗cpu，正常情况下应该使用100%的cpu（即消耗一个内核）
dev@ubuntu:/sys/fs/cgroup/cpu,cpuacct/test$ while :; do echo test > /dev/null; done

#--------------------------重新打开一个shell窗口----------------------
#通过top命令可以看到5456的CPU使用率为20%左右，说明被限制住了
#不过这时系统的%us+%sy在10%左右，那是因为我测试的机器上cpu是双核的，
#所以系统整体的cpu使用率为10%左右
dev@ubuntu:~$ top
Tasks: 139 total,   2 running, 137 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.6 us,  6.2 sy,  0.0 ni, 88.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :   499984 total,    15472 free,    81488 used,   403024 buff/cache
KiB Swap:        0 total,        0 free,        0 used.   383332 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5456 dev       20   0   22640   5472   3524 R  20.3  1.1   0:04.62 bash

#这时可以看到被限制的统计结果
dev@ubuntu:~$ cat /sys/fs/cgroup/cpu,cpuacct/test/cpu.stat
nr_periods 1436
nr_throttled 1304
throttled_time 51542291833

kubernetes 资源控制机制

kubernetes使用runc作为runtime, runc中通过cpuGroup对cpu子系统的的调度进行设置. 其中Set()用于初始设置, Apply()方法用于动态更改设置, . 原理就是往上述的指定文件写入相应条目和数字. 没有什么可说的.

内存的Set():

func (s *MemoryGroup) Set(path string, cgroup *configs.Cgroup) error {
	if err := setMemoryAndSwap(path, cgroup); err != nil {
		return err
	}

	if cgroup.Resources.KernelMemory != 0 {
		if err := setKernelMemory(path, cgroup.Resources.KernelMemory); err != nil {
			return err
		}
	}

	if cgroup.Resources.MemoryReservation != 0 {
		if err := writeFile(path, "memory.soft_limit_in_bytes", strconv.FormatInt(cgroup.Resources.MemoryReservation, 10)); err != nil {
			return err
		}
	}

	if cgroup.Resources.KernelMemoryTCP != 0 {
		if err := writeFile(path, "memory.kmem.tcp.limit_in_bytes", strconv.FormatInt(cgroup.Resources.KernelMemoryTCP, 10)); err != nil {
			return err
		}
	}
	if cgroup.Resources.OomKillDisable {
		if err := writeFile(path, "memory.oom_control", "1"); err != nil {
			return err
		}
	}
	if cgroup.Resources.MemorySwappiness == nil || int64(*cgroup.Resources.MemorySwappiness) == -1 {
		return nil
	} else if *cgroup.Resources.MemorySwappiness <= 100 {
		if err := writeFile(path, "memory.swappiness", strconv.FormatUint(*cgroup.Resources.MemorySwappiness, 10)); err != nil {
			return err
		}
	} else {
		return fmt.Errorf("invalid value:%d. valid memory swappiness range is 0-100", *cgroup.Resources.MemorySwappiness)
	}

	return nil
}

内存Apply():

func (s *MemoryGroup) Apply(d *cgroupData) (err error) {
	path, err := d.path("memory")
	if err != nil && !cgroups.IsNotFound(err) {
		return err
	} else if path == "" {
		return nil
	}
	if memoryAssigned(d.config) {
		if _, err := os.Stat(path); os.IsNotExist(err) {
			if err := os.MkdirAll(path, 0755); err != nil {
				return err
			}
			// 只有内核内存可以动态设置
			if err := EnableKernelMemoryAccounting(path); err != nil {
				return err
			}
		}
	}
	...
}

func EnableKernelMemoryAccounting(path string) error {
	// Check if kernel memory is enabled
	// We have to limit the kernel memory here as it won't be accounted at all
	// until a limit is set on the cgroup and limit cannot be set once the
	// cgroup has children, or if there are already tasks in the cgroup.
	for _, i := range []int64{1, -1} {
		if err := setKernelMemory(path, i); err != nil {
			return err
		}
	}
	return nil
}

cpu的Set():

func (s *CpuGroup) Set(path string, cgroup *configs.Cgroup) error {
    // 设置CFS 
	if cgroup.Resources.CpuShares != 0 {
		if err := writeFile(path, "cpu.shares", strconv.FormatUint(cgroup.Resources.CpuShares, 10)); err != nil {
			return err
		}
	}
	if cgroup.Resources.CpuPeriod != 0 {
		if err := writeFile(path, "cpu.cfs_period_us", strconv.FormatUint(cgroup.Resources.CpuPeriod, 10)); err != nil {
			return err
		}
	}
	if cgroup.Resources.CpuQuota != 0 {
		if err := writeFile(path, "cpu.cfs_quota_us", strconv.FormatInt(cgroup.Resources.CpuQuota, 10)); err != nil {
			return err
		}
	}
	// 设置RT
	if err := s.SetRtSched(path, cgroup); err != nil {
		return err
	}

	return nil
}

func (s *CpuGroup) SetRtSched(path string, cgroup *configs.Cgroup) error {
	if cgroup.Resources.CpuRtPeriod != 0 {
		if err := writeFile(path, "cpu.rt_period_us", strconv.FormatUint(cgroup.Resources.CpuRtPeriod, 10)); err != nil {
			return err
		}
	}
	if cgroup.Resources.CpuRtRuntime != 0 {
		if err := writeFile(path, "cpu.rt_runtime_us", strconv.FormatInt(cgroup.Resources.CpuRtRuntime, 10)); err != nil {
			return err
		}
	}
	return nil
}

cpu的Apply() 只能设置RT:

func (s *CpuGroup) Apply(d *cgroupData) error {
	// We always want to join the cpu group, to allow fair cpu scheduling
	// on a container basis
	path, err := d.path("cpu")
	if err != nil && !cgroups.IsNotFound(err) {
		return err
	}
	return s.ApplyDir(path, d.config, d.pid)
}

func (s *CpuGroup) ApplyDir(path string, cgroup *configs.Cgroup, pid int) error {
	// This might happen if we have no cpu cgroup mounted.
	// Just do nothing and don't fail.
	if path == "" {
		return nil
	}
	if err := os.MkdirAll(path, 0755); err != nil {
		return err
	}
	// We should set the real-Time group scheduling settings before moving
	// in the process because if the process is already in SCHED_RR mode
	// and no RT bandwidth is set, adding it will fail.
	if err := s.SetRtSched(path, cgroup); err != nil {
		return err
	}
	// because we are not using d.join we need to place the pid into the procs file
	// unlike the other subsystems
	if err := cgroups.WriteCgroupProc(path, pid); err != nil {
		return err
	}

	return nil
}

K8s Limits & Request 代码分析

k8s中管理cgroup的结构体在k8s.io/kubernetes/pkg/kubelet/cm/cgroup_manager_linux.go中:

// cgroupManagerImpl implements the CgroupManager interface.
// Its a stateless object which can be used to
// update,create or delete any number of cgroups
// It uses the Libcontainer raw fs cgroup manager for cgroup management.
type cgroupManagerImpl struct {
	// subsystems holds information about all the
	// mounted cgroup subsystems on the node
	subsystems *CgroupSubsystems
	// simplifies interaction with libcontainer and its cgroup managers
	adapter *libcontainerAdapter
}

来看下Update方法:

// Update updates the cgroup with the specified Cgroup Configuration
func (m *cgroupManagerImpl) Update(cgroupConfig *CgroupConfig) error {
	...
	// 提取cgroup资源参数
	resourceConfig := cgroupConfig.ResourceParameters
	resources := m.toResources(resourceConfig)

	cgroupPaths := m.buildCgroupPaths(cgroupConfig.Name)

	// 获得cgroupfs的位置.
	abstractCgroupFsName := string(cgroupConfig.Name)
	abstractParent := CgroupName(path.Dir(abstractCgroupFsName))
	abstractName := CgroupName(path.Base(abstractCgroupFsName))

	driverParent := m.adapter.adaptName(abstractParent, false)
	driverName := m.adapter.adaptName(abstractName, false)

	// 获取systemd的绝对位置
	if m.adapter.cgroupManagerType == libcontainerSystemd {
		driverName = m.adapter.adaptName(cgroupConfig.Name, false)
	}

	// 初始化cgroup配置
	libcontainerCgroupConfig := &libcontainerconfigs.Cgroup{
		Name:      driverName,
		Parent:    driverParent,
		Resources: resources,
		Paths:     cgroupPaths,
	}

    // 设置cgroup配置
	if err := setSupportedSubsystems(libcontainerCgroupConfig); err != nil {
		return fmt.Errorf("failed to set supported cgroup subsystems for cgroup %v: %v", cgroupConfig.Name, err)
	}
	return nil
}

先看下参数是如何提取的:

func (m *cgroupManagerImpl) toResources(resourceConfig *ResourceConfig) *libcontainerconfigs.Resources {
	resources := &libcontainerconfigs.Resources{}
	if resourceConfig == nil {
		return resources
	}
	if resourceConfig.Memory != nil {
		resources.Memory = *resourceConfig.Memory
	}
	if resourceConfig.CpuShares != nil {
		resources.CpuShares = *resourceConfig.CpuShares
	}
	if resourceConfig.CpuQuota != nil {
		resources.CpuQuota = *resourceConfig.CpuQuota
	}
	if resourceConfig.CpuPeriod != nil {
		resources.CpuPeriod = *resourceConfig.CpuPeriod
	}

	// huge page参数提取, 略过
	if utilfeature.DefaultFeatureGate.Enabled(kubefeatures.HugePages) {
    ...
	return resources
}

在来看看CpuShare, CpuQuota, 和CpuPeriod都来字哪里:

// ResourceConfigForPod takes the input pod and outputs the cgroup resource config.
func ResourceConfigForPod(pod *v1.Pod) *ResourceConfig {
	// sum requests and limits.
	reqs, limits := resource.PodRequestsAndLimits(pod)

	cpuRequests := int64(0)
	cpuLimits := int64(0)
	memoryLimits := int64(0)
	if request, found := reqs[v1.ResourceCPU]; found {
		cpuRequests = request.MilliValue()
	}
	if limit, found := limits[v1.ResourceCPU]; found {
		cpuLimits = limit.MilliValue()
	}
	if limit, found := limits[v1.ResourceMemory]; found {
		memoryLimits = limit.Value()
	}

	// convert to CFS values
	cpuShares := MilliCPUToShares(cpuRequests)
	cpuQuota, cpuPeriod := MilliCPUToQuota(cpuLimits)
    ...
}

// MilliCPUToShares converts the milliCPU to CFS shares.
func MilliCPUToShares(milliCPU int64) uint64 {
	if milliCPU == 0 {
		// Docker converts zero milliCPU to unset, which maps to kernel default
		// for unset: 1024. Return 2 here to really match kernel default for
		// zero milliCPU.
		return MinShares
	}
	// Conceptually (milliCPU / milliCPUToCPU) * sharesPerCPU, but factored to improve rounding.
	shares := (milliCPU * SharesPerCPU) / MilliCPUToCPU
	if shares < MinShares {
		return MinShares
	}
	return uint64(shares)
}

// MilliCPUToQuota converts milliCPU to CFS quota and period values.
func MilliCPUToQuota(milliCPU int64) (quota int64, period uint64) {
	// CFS quota is measured in two values:
	//  - cfs_period_us=100ms (the amount of time to measure usage across)
	//  - cfs_quota=20ms (the amount of cpu time allowed to be used across a period)
	// so in the above example, you are limited to 20% of a single CPU
	// for multi-cpu environments, you just scale equivalent amounts

	if milliCPU == 0 {
		return
	}

	// we set the period to 100ms by default
	period = QuotaPeriod

	// we then convert your milliCPU to a value normalized over a period
	quota = (milliCPU * QuotaPeriod) / MilliCPUToCPU

	// quota needs to be a minimum of 1ms.
	if quota < MinQuotaPeriod {
		quota = MinQuotaPeriod
	}

	return
}

const (
	// Taken from lmctfy https://github.com/google/lmctfy/blob/master/lmctfy/controllers/cpu_controller.cc
	MinShares     = 2
	SharesPerCPU  = 1024
	MilliCPUToCPU = 1000

	// 100000 is equivalent to 100ms
	QuotaPeriod    = 100000
	MinQuotaPeriod = 1000
)

设置的代码:

func setSupportedSubsystems(cgroupConfig *libcontainerconfigs.Cgroup) error {
	for _, sys := range getSupportedSubsystems() {
		if _, ok := cgroupConfig.Paths[sys.Name()]; !ok {
			return fmt.Errorf("Failed to find subsystem mount for subsystem: %v", sys.Name())
		}
        // 调用前面的cgroup.Set()方法
		if err := sys.Set(cgroupConfig.Paths[sys.Name()], cgroupConfig); err != nil {
			return fmt.Errorf("Failed to set config for supported subsystems : %v", err)
		}
	}
	return nil
}

代码很明显了:

Request -> cpu.shares

Limits -> cpu.quota(CFS或者RT)

QuotaPeriod 为100ms

所以当Request设置0.5(0.5 1024=512), 等价于设置500m(500 1024/1000=512), 也就是512.

1 2	[root@tdc-tester04 ~]# cat /sys/fs/cgroup/cpu/kubepods/burstable/pod4de91174-9002-11e8-a663-ac1f6b83dd66/1cbc69fd7756efdcba38d11a57acab5cdbe0b9b2e5eef2a9e86e3c6c8850b1a1/cpu.shares 512

当Limit设置1(1 1000 100000/1000=100000), 等价于1000m(1000 * 100000 / 1000 = 100000 ), 也就是quota=100000, period=100000.

[root@tdc-tester04 ~]# cat /sys/fs/cgroup/cpu/kubepods/burstable/pod4de91174-9002-11e8-a663-ac1f6b83dd66/1cbc69fd7756efdcba38d11a57acab5cdbe0b9b2e5eef2a9e86e3c6c8850b1a1/cpu.cfs_quota_us 
100000
[root@tdc-tester04 ~]# cat /sys/fs/cgroup/cpu/kubepods/burstable/pod4de91174-9002-11e8-a663-ac1f6b83dd66/1cbc69fd7756efdcba38d11a57acab5cdbe0b9b2e5eef2a9e86e3c6c8850b1a1/cpu.cfs_period_us 
100000

至此所有路走通.

值得注意的点

The CPU resource is measured in cpu units. One cpu, in Kubernetes, is equivalent to:

    1 AWS vCPU
    1 GCP Core
    1 Azure vCore
    1 Hyperthread on a bare-metal Intel processor with Hyperthreading

从文档来看所以4核8线程对k8s来讲就是8核.

如果只设置了request没有设置limit, 意味着一个pod可以任意使用node资源, 如果没有其他pod创建. 如果集群管理员设置了LimitRange, 那么当pod没有设置limit的时候就会使用LimitRange里设置的值作为默认.