BPF CPU Usage High Issue: Root Cause Analysis and Fix

BPF CPU Usage High issue

Objective

Problem Statement

https://docs.google.com/document/d/1HlvGBoT8gL3LToCIB8KH88EG4toh7YiiQnNApNyosOo/edit

https://docs.google.com/document/d/1ibjJWueVClKet0by0fWbFHz2vEfj3viL5ss8byJuuA4/edit#heading=h.ubx3xrz6e41r

We met the bpf cpu high issue many times after the cilium conntrack table was full. It impacts the L4/L7 traffic reliability and also impacts the reliability confidence about cilium and the cilium rollout on tlb/gateway nodes.

We have taken some measures to try to avoid the conntrack table from getting full. However, if the load further increases or if the garbage collection is delayed for some reason, the conntrack table can still become full, which will trigger this issue. Therefore, we aim to address this problem fundamentally.

Reproduce Steps

Reproduced on tess834. (concurrent fortioclients with short connections + ingressgateway/multi-vips + trafficobserver)

Ingressgateway

Key Point: Uniform high-concurrency softirq handling

istio-ingressgateway-865c7b78b6-htfv5 2/2 Running 10 (2d20h ago) 20d 10.163.84.64 tess-node-8h7nt-tess834.stratus.lvs.ebay.com <none> <none>

Configurations for gateway pod:

  • best effort pod
  • dedicated node, or move all the client, server loads out of this node.
  • 50 workers

Configurations for this gateway node: (tlb node needs the similar configuration)

  • Enlarge the netfilter conntrack settings to avoid it’s the bottleneck.
sysctl -w net.netfilter.nf_conntrack_max=22000000 sysctl -w net.netfilter.nf_conntrack_buckets=5500416 sysctl -w net.netfilter.nf_conntrack_tcp_timeout_established=3600 sysctl -w net.netfilter.nf_conntrack_tcp_timeout_fin_wait=30 sysctl -w net.netfilter.nf_conntrack_tcp_timeout_last_ack=30 sysctl -w net.netfilter.nf_conntrack_tcp_timeout_syn_recv=30 sysctl -w net.netfilter.nf_conntrack_tcp_timeout_syn_sent=30 sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=30
  • Enable irq balance on this node to make the network handling even.
add `IRQBALANCE_ARGS=–policyscript=/etc/sysconfig/policyscript.sh -e 1` into /etc/default/irqbalance systemctl restart irqbalance

Vips:

apiVersion: apps.tess.io/v1alpha3 kind: AccessPoint metadata: annotations: gateway.network.tess.io/bind-vip: “true” gateway.network.tess.io/cert-provider: protego network.tess.io/init-clusterids: “834” labels: foo: bar name: fortioserver-ap- namespace: cilium-fortio-lnp spec: accessPoints: - name: ap-834- scopeIDs: - “834” scopeType: Cluster strategies: circuitBreakerStrategy: {} loadBalancingStrategy: loadBalancerMethod: {} placement: {} rollout: {} slowStartStrategy: {} trafficShift: healthSelector: {} workloadSelector: {} traffic: gateways: - apiVersion: networking.istio.io/v1beta1 kind: Gateway metadata: annotations: network.tess.io/lbprovider: tlb-ipvs network.tess.io/skip-proxy: “true” creationTimestamp: null name: fortioserver-gw- spec: selector: istio: ingressgateway-production-01 servers: - hosts: - cilium-fortio-lnp-fortioserver-gw-.istio-production.svc.834.tess.io port: name: http number: 80 protocol: HTTP status: {} services: - apiVersion: v1 kind: Service metadata: creationTimestamp: null name: fortioserver- spec: ports: - name: http-echo port: 8080 protocol: TCP targetPort: 0 selector: app: fortioserver- type: ClusterIP status: loadBalancer: {} virtualServices: - apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: creationTimestamp: null name: fortioserver-vs- spec: gateways: - fortioserver-gw- hosts: - cilium-fortio-lnp-fortioserver-gw-.istio-production.svc.834.tess.io http: - match: - uri: prefix: / route: - destination: host: fortioserver- port: number: 8080 status: {} applicationInstanceRef: kind: ApplicationInstance name: cilium-fortio-lnp namespace: cilium-fortio-lnp

Fortio Clients

Key Point: Continuous high-concurrency new connection establishment

Configuration for client pods:

  • Fortio clients
  • 80 replicas
  • Best effort pods, get rid of locating on the gateway node. (make sure client cpu is not the bottleneck)
  • Start 240 fortio clients in total against 3 vips in the 80 client pods.
#!/usr/bin/env bash for ip in $(cat ./ips); do for pod in $(cat ./pods); do nohup kubectl --kubeconfig /var/lib/kube-secrets-manager/kubeconfig exec -n cilium-fortio-lnp $pod -- fortio load -a -c 300 -qps 1000 -t 100000s -allow-initial-errors -log-errors=false http://$ip:80 > /dev/null 2>&1 & done done

Using vip ips instead of fqdn makes the client loads become short connections (because fqdn is configured into the gateway rule, so if using the vip ip, the request will return 404 and start a new connection again by using the above fortio command).

TrafficObserver

Key Point: The continuous batch lookup by BPF Syscall to lock the hash buckets

Start up trafficobserver by using file mode.

- args: - /trafficobserver - --mode=file - --filepath=’/dev/null’ - --v=4 env: - name: K8S_NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName - name: GOGC value: “50” image: hub.tess.io/yingnzhang/trafficobserver:file

this trafficobserver image include the fix for file mode: (https://github.corp.ebay.com/ebaytess/cloudmesh/pull/108)

diff --git a/trafficobserver/pkg/exporter/fileexporter.go b/trafficobserver/pkg/exporter/fileexporter.go index 2b7d952..b690b20 100644 -– a/trafficobserver/pkg/exporter/fileexporter.go +++ b/trafficobserver/pkg/exporter/fileexporter.go @@ -17,7 +17,6 @@ type FileExporter struct { ConvertToJson bool FileLocation string Unique bool - Interval time.Duration } type Output struct { @@ -29,7 +28,7 @@ type Output struct { func (e *FileExporter) Export() error { // Create a ticker that ticks every five minutes - ticker := time.NewTicker(e.Interval) + ticker := time.NewTicker(DefaultInterval) // Create a map to track unique lines uniqueLines := make(map[string]bool) var linesToWrite []string @@ -41,6 +40,8 @@ func (e *FileExporter) Export() error { // Run the command periodically bpfMapReader, _ := reader.New([]string{}) + // start the reader + go bpfMapReader.Run() for range ticker.C { linesToWrite, err = bpfMapReader.ReadCtMap() if err != nil {

Reproduce result