Kubernetes Pod进程网络带宽 流量控制

背景

混合云场景业务Pod直接相互干扰在离线混部(在离线服务同时在一台机器上服务用户) 等场景下,除了对cpumemfdinodepid等进行隔离,还需要对 网络带宽bandwidth磁盘读写速度IPOSNBD IOL3 Cache内存带宽MBA 等都需要做到隔离和限制

因此,本章节介绍下 网络带宽bandwidth limit 的使用和实现

Kubernetes 具体使用和实现

cni plugin

容器拉起,是通过运行时接口对底层cni网络插件来生产虚拟网络,bind到容器实现。对容器进行网络限制,底层需要cni网络插件的限制,而cni网络插件 会将网络限制指令,将具体配置提交给 Linux 流量控制 (tc) 子系统,tc 包含一组机制和操作,数据包通过这些机制和操作在网络接口上排队等待传输/接收(令牌桶过滤器TBF),从而达到流量控制

CNI 对 Linux TC 操作

{
  "name": "k8s-pod-network",
  "cniVersion": "0.3.0",  #必须0.3.0 containernetworking plugin 目前最高版本
  "plugins":
    [
      {
        "type": "calico",
        "log_level": "info",
        "datastore_type": "kubernetes",
        "nodename": "127.0.0.1",
        "ipam": { "type": "host-local", "subnet": "usePodCidr" },
        "policy": { "type": "k8s" },
        "kubernetes": { "kubeconfig": "/etc/cni/net.d/calico-kubeconfig" },
      },
      { 
    	"type": "bandwidth", 
    	"capabilities": {     
			"bandwidth": true   #支持cri-o json配置提交
		},
		/* 以下是对cni插件网络限流操作, capabilities和一下4个配置二选一
		"ingressRate": 123,
      	"ingressBurst": 456,
        "egressRate": 123,
        "egressBurst": 456
        */
	  },
    ]
}

cni插件支持本配置,也支持cri-ocontaierddockershim等通过json配置提交

func cmdAdd(args *skel.CmdArgs) error {
	// cni 配置解析
	conf, err := parseConfig(args.StdinData)
	if err != nil {
		return err
	}
    
	//...
	
	// 从配置中活动 ingress Rate和Burst
	if bandwidth.IngressRate > 0 && bandwidth.IngressBurst > 0 {
		// TC TBF 中创建流控规则
		err = CreateIngressQdisc(bandwidth.IngressRate, bandwidth.IngressBurst, hostInterface.Name)
		if err != nil {
			return err
		}
	}
	
	// 从配置中活动 egress Rate和Burst
	if bandwidth.EgressRate > 0 && bandwidth.EgressBurst > 0 {
		// ...
		
		// 对特定本地Device设置出口流控规则
		err = CreateEgressQdisc(bandwidth.EgressRate, bandwidth.EgressBurst, hostInterface.Name, ifbDeviceName)
		if err != nil {
			return err
		}
	}

	return types.PrintResult(result, conf.CNIVersion)
}

OCR 流控配置

通过Pod配置annotations

apiVersion: v1
kind: Pod
  metadata:
    name: iperf-slow
  annotations:
    kubernetes.io/ingress-bandwidth: 10M
    kubernetes.io/egress-bandwidth: 10M
...

Kubenetes 代码支持在 pod annotations解析和使用

kubernetes.io/ingress-bandwidthkubernetes.io/egress-bandwidth 值只是支持 1k-1P, 超过32G需要调整Kernel参数

// 配置值在 1k-1p之间
var minRsrc = resource.MustParse("1k")  
var maxRsrc = resource.MustParse("1P")

// 获取pod annotations并传递给 runc
func ExtractPodBandwidthResources(podAnnotations map[string]string) (ingress, egress *resource.Quantity, err error) {
	if podAnnotations == nil {
		return nil, nil, nil
	}
	str, found := podAnnotations["kubernetes.io/ingress-bandwidth"]
	if found {
		ingressValue, err := resource.ParseQuantity(str)
		if err != nil {
			return nil, nil, err
		}
		ingress = &ingressValue
		if err := validateBandwidthIsReasonable(ingress); err != nil {
			return nil, nil, err
		}
	}
	str, found = podAnnotations["kubernetes.io/egress-bandwidth"]
	if found {
		egressValue, err := resource.ParseQuantity(str)
		if err != nil {
			return nil, nil, err
		}
		egress = &egressValue
		if err := validateBandwidthIsReasonable(egress); err != nil {
			return nil, nil, err
		}
	}
	return ingress, egress, nil
}

contaierd为例, kubelet 活动 pod yaml信息后续,传递给containerd runtime,并继续传递给cni插件

func cniNamespaceOpts(id string, config *runtime.PodSandboxConfig) ([]cni.NamespaceOpts, error) {
	opts := []cni.NamespaceOpts{
		cni.WithLabels(toCNILabels(id, config)),
		cni.WithCapability(annotations.PodAnnotations, config.Annotations),
	}

	portMappings := toCNIPortMappings(config.GetPortMappings())
	if len(portMappings) > 0 {
		opts = append(opts, cni.WithCapabilityPortMap(portMappings))
	}

	// pod annotations中获得配置,最后传递给cni
	bandWidth, err := toCNIBandWidth(config.Annotations)
	if err != nil {
		return nil, err
	}
	if bandWidth != nil {
		opts = append(opts, cni.WithCapabilityBandWidth(*bandWidth))
	}
	// ...
}

验证和测试

** 流控依赖Linux TC子系统。目前只支持Linux K8s集群 **

apiVersion: apps/v1
kind: Deployment
metadata:
  name: iperf-server-deployment
  labels:
    app: iperf-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: iperf-server
  template:
    metadata:
      labels:
        app: iperf-server
      #添加注解
      annotations:
        kubernetes.io/ingress-bandwidth: 1M
        kubernetes.io/egress-bandwidth: 1M
    spec:
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
      containers:
      - name: iperf3-server
        image: dongjiang1989/iperf
        args: ['-s', '-p', '5001']
        ports:
        - containerPort: 5001
          name: server
      terminationGracePeriodSeconds: 0

---
    
apiVersion: apps/v1
kind: Deployment
metadata:
  name: iperf-client
  labels:
    app: iperf-client
spec:
  replicas: 1
  selector:
    matchLabels:
      app: iperf-client
  template:
    metadata:
      labels:
        app: iperf-client
    spec:
      containers:
      - name: iperf-client
        image: dongjiang1989/iperf
        command: ['/bin/sh', '-c', 'sleep 1d']
      terminationGracePeriodSeconds: 0

对于未添加网络限流注解

$ kubectl get pod | grep iperf 
iperf-client-7874c47d95-t7hph              1/1     Running   0               5m58s
iperf-server-deployment-74d94bdd59-dzdl4   1/1     Running   0               5m58s
kubectl exec iperf-client-7874c47d95-t7hph -- iperf -c 10.1.0.173 -p 5001 -i 10 -t 100
------------------------------------------------------------
Client connecting to 10.1.0.173, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  1] local 10.1.0.172 port 56296 connected with 10.1.0.173 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.00 sec  19.7 GBytes  16.9 Gbits/sec
[  1] 10.00-20.00 sec  18.9 GBytes  16.2 Gbits/sec
[  1] 20.00-30.00 sec  20.0 GBytes  17.2 Gbits/sec
[  1] 30.00-40.00 sec  20.4 GBytes  17.5 Gbits/sec
[  1] 40.00-50.00 sec  18.5 GBytes  15.9 Gbits/sec
[  1] 50.00-60.00 sec  19.3 GBytes  16.5 Gbits/sec
[  1] 60.00-70.00 sec  17.6 GBytes  15.1 Gbits/sec
[  1] 70.00-80.00 sec  17.1 GBytes  14.7 Gbits/sec
[  1] 80.00-90.00 sec  18.4 GBytes  15.8 Gbits/sec
[  1] 90.00-100.00 sec  15.1 GBytes  13.0 Gbits/sec
[  1] 0.00-100.00 sec   185 GBytes  15.9 Gbits/sec

未做限流,Bandwidth可以到15.9Gbits/sec

对于添加网络限流注解

$ kubectl get pod | grep iperf
iperf-clients-rcsh6                        1/1     Running   0          7h7m
iperf-server-deployment-59675c8f78-g52pm   1/1     Running   0          6h52m

$ kubectl exec iperf-clients-rcsh6 -- iperf -c 10.1.0.170 -p 5001 -i 10 -t 100
------------------------------------------------------------
Client connecting to 10.1.0.170, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[  1] local 10.1.0.170 port 54652 connected with 10.1.0.170 port 5001
[ ID] Interval       Transfer     Bandwidth
[  1] 0.00-10.00 sec  3.50 MBytes  2.94 Mbits/sec
[  1] 10.00-20.00 sec  2.25 MBytes  1.89 Mbits/sec
[  1] 20.00-30.00 sec  2.04 MBytes  1.71 Mbits/sec
[  1] 30.00-40.00 sec   892 KBytes   731 Kbits/sec
[  1] 40.00-50.00 sec   954 KBytes   781 Kbits/sec
[  1] 50.00-60.00 sec  1.36 MBytes  1.14 Mbits/sec
[  1] 60.00-70.00 sec  1.18 MBytes   993 Kbits/sec
[  1] 70.00-80.00 sec  87.1 KBytes  71.4 Kbits/sec
[  1] 80.00-90.00 sec  0.000 Bytes  0.000 bits/sec
[  1] 90.00-100.00 sec  2.97 MBytes  2.50 Mbits/sec
[  1] 0.00-100.69 sec  15.5 MBytes  1.29 Mbits/sec

限制1Mbits/sec, 流控真实表现是 1.29 Mbits/sec

  • 为啥限制1Mbits/sec, 流控真实表现略大约1Mbits/sec
  • 原因:在Linux系统中, 1M = 1024k的; 而 K8s中使用 Resource 对象实现的 1M = 1000k的.
  • 因此,真实 设置 1Mbits/sec 在 Linux 中的表现应该是 1024*1024(bits/sec)/(1000*1000) = 1.048Mbits/sec.
  • 在0-1s之间,TC控制不准确,会有数据平均增大的问题

总结

    1. docker 1.18支持runc runtime json传递;containerd作为runtime, 1.4版本才能支持;
    1. calico需要2.1版本; cilium需要1.12.90版本; kube-ovn需要版本1.9.0版本;但是需要支持
     `ovn.kubernetes.io/ingress_rate` : Ingress 流量的速率限制,单位:Mbits/s
     `ovn.kubernetes.io/egress_rate` : Egress 流量的速率限制,单位:Mbits/s
    
    1. 不能动态更新annotation里面的流量限制大小,更新之后必须删除pod重建;

因此,需要通过webhook来将丰富配置namespcae下的limitrange含义拉齐, 并支持默认填充

具体实现方式

先通过 CRD 描述 namespacelimitrange 扩展限制

设计如下:

apiVersion: custom.xxx.com/v1
kind: CustomLimitRange
metadata:
  name: test-rangelimit
spec:
  limitrange:
    type: pod      # 对pod类型限制,以后扩展到 contianer类型、ingress类型、service类型
    max:           # max和min是限制的上下线,如果pod自定义的值不在其中,ValidatingAdmissionWebhook校验报错
      ingress-bandwidth: "1G"
      egress-bandwidth: "1G"
    min:
      ingress-bandwidth: "10M"
      egress-bandwidth: "10M"
    default:          # 定义了default,如果pod annotation为空,MutatingAdmissionWebhook自动注入此数据;未定义default,不作强注入操作
      ingress-bandwidth: "128M"
      egress-bandwidth: "128M"

pod 可以是支持设置 customlimitrange.kubernetes.io/limited : disable, 可支持 ignore namespace下CustomLimitRange限制

注意 本身CustomLimitRange自身校验必不可少:

  • max value >= default value >= min value
  • value range [1k, 1P] && value 类型 Kbits/sec, Mbits/sec, Gbits/sec, Tbits/sec 和 Pbits/sec
  • type 类型 enum
  • max、min 和 default 可缺省
  • 内部适配:kube-ovn annotation

使用方式

    1. Pod和Deployment添加注解annotation
# Pod
apiVersion: v1
kind: Pod
metadata:
  name: xxxx
  annotations:
    kubernetes.io/ingress-bandwidth: 1M
    kubernetes.io/egress-bandwidth: 1M
...


# Deployment
...
 spec:
  template:
    metadata:
      #添加注解
      annotations:
        kubernetes.io/ingress-bandwidth: 1M
        kubernetes.io/egress-bandwidth: 1M
...
    1. 通过定义Custom LimitRange 自动添加annotation. 如以上

下一章节

diskio blkiodevice IPOS流量控制