Kubernetes Pod进程网络带宽 流量控制
背景
混合云场景业务Pod直接相互干扰
、 在离线混部
(在离线服务同时在一台机器上服务用户) 等场景下,除了对cpu
、mem
、fd
、inode
、pid
等进行隔离,还需要对 网络带宽bandwidth
、磁盘读写速度IPOS
、NBD IO
、L3 Cache
、内存带宽MBA
等都需要做到隔离和限制
因此,本章节介绍下 网络带宽bandwidth limit
的使用和实现
Kubernetes 具体使用和实现
容器拉起,是通过运行时接口
对底层cni网络插件
来生产虚拟网络,bind到容器实现。对容器进行网络限制,底层需要cni网络插件
的限制,而cni网络插件
会将网络限制指令,将具体配置提交给 Linux
流量控制 (tc) 子系统,tc 包含一组机制和操作,数据包通过这些机制和操作在网络接口上排队等待传输/接收(令牌桶过滤器TBF),从而达到流量控制
CNI 对 Linux TC 操作
{
"name": "k8s-pod-network",
"cniVersion": "0.3.0", #必须0.3.0 containernetworking plugin 目前最高版本
"plugins":
[
{
"type": "calico",
"log_level": "info",
"datastore_type": "kubernetes",
"nodename": "127.0.0.1",
"ipam": { "type": "host-local", "subnet": "usePodCidr" },
"policy": { "type": "k8s" },
"kubernetes": { "kubeconfig": "/etc/cni/net.d/calico-kubeconfig" },
},
{
"type": "bandwidth",
"capabilities": {
"bandwidth": true #支持cri-o json配置提交
},
/* 以下是对cni插件网络限流操作, capabilities和一下4个配置二选一
"ingressRate": 123,
"ingressBurst": 456,
"egressRate": 123,
"egressBurst": 456
*/
},
]
}
cni插件支持本配置,也支持cri-o
、contaierd
、dockershim
等通过json
配置提交
func cmdAdd(args *skel.CmdArgs) error {
// cni 配置解析
conf, err := parseConfig(args.StdinData)
if err != nil {
return err
}
//...
// 从配置中活动 ingress Rate和Burst
if bandwidth.IngressRate > 0 && bandwidth.IngressBurst > 0 {
// TC TBF 中创建流控规则
err = CreateIngressQdisc(bandwidth.IngressRate, bandwidth.IngressBurst, hostInterface.Name)
if err != nil {
return err
}
}
// 从配置中活动 egress Rate和Burst
if bandwidth.EgressRate > 0 && bandwidth.EgressBurst > 0 {
// ...
// 对特定本地Device设置出口流控规则
err = CreateEgressQdisc(bandwidth.EgressRate, bandwidth.EgressBurst, hostInterface.Name, ifbDeviceName)
if err != nil {
return err
}
}
return types.PrintResult(result, conf.CNIVersion)
}
OCR 流控配置
通过Pod配置annotations
apiVersion: v1
kind: Pod
metadata:
name: iperf-slow
annotations:
kubernetes.io/ingress-bandwidth: 10M
kubernetes.io/egress-bandwidth: 10M
...
Kubenetes 代码支持在 pod annotations解析和使用
kubernetes.io/ingress-bandwidth
和 kubernetes.io/egress-bandwidth
值只是支持 1k-1P, 超过32G需要调整Kernel参数
// 配置值在 1k-1p之间
var minRsrc = resource.MustParse("1k")
var maxRsrc = resource.MustParse("1P")
// 获取pod annotations并传递给 runc
func ExtractPodBandwidthResources(podAnnotations map[string]string) (ingress, egress *resource.Quantity, err error) {
if podAnnotations == nil {
return nil, nil, nil
}
str, found := podAnnotations["kubernetes.io/ingress-bandwidth"]
if found {
ingressValue, err := resource.ParseQuantity(str)
if err != nil {
return nil, nil, err
}
ingress = &ingressValue
if err := validateBandwidthIsReasonable(ingress); err != nil {
return nil, nil, err
}
}
str, found = podAnnotations["kubernetes.io/egress-bandwidth"]
if found {
egressValue, err := resource.ParseQuantity(str)
if err != nil {
return nil, nil, err
}
egress = &egressValue
if err := validateBandwidthIsReasonable(egress); err != nil {
return nil, nil, err
}
}
return ingress, egress, nil
}
以contaierd
为例, kubelet 活动 pod yaml信息后续,传递给containerd runtime
,并继续传递给cni插件
func cniNamespaceOpts(id string, config *runtime.PodSandboxConfig) ([]cni.NamespaceOpts, error) {
opts := []cni.NamespaceOpts{
cni.WithLabels(toCNILabels(id, config)),
cni.WithCapability(annotations.PodAnnotations, config.Annotations),
}
portMappings := toCNIPortMappings(config.GetPortMappings())
if len(portMappings) > 0 {
opts = append(opts, cni.WithCapabilityPortMap(portMappings))
}
// pod annotations中获得配置,最后传递给cni
bandWidth, err := toCNIBandWidth(config.Annotations)
if err != nil {
return nil, err
}
if bandWidth != nil {
opts = append(opts, cni.WithCapabilityBandWidth(*bandWidth))
}
// ...
}
验证和测试
** 流控依赖Linux TC子系统。目前只支持Linux K8s集群 **
apiVersion: apps/v1
kind: Deployment
metadata:
name: iperf-server-deployment
labels:
app: iperf-server
spec:
replicas: 1
selector:
matchLabels:
app: iperf-server
template:
metadata:
labels:
app: iperf-server
#添加注解
annotations:
kubernetes.io/ingress-bandwidth: 1M
kubernetes.io/egress-bandwidth: 1M
spec:
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: iperf3-server
image: dongjiang1989/iperf
args: ['-s', '-p', '5001']
ports:
- containerPort: 5001
name: server
terminationGracePeriodSeconds: 0
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: iperf-client
labels:
app: iperf-client
spec:
replicas: 1
selector:
matchLabels:
app: iperf-client
template:
metadata:
labels:
app: iperf-client
spec:
containers:
- name: iperf-client
image: dongjiang1989/iperf
command: ['/bin/sh', '-c', 'sleep 1d']
terminationGracePeriodSeconds: 0
对于未添加网络限流注解
$ kubectl get pod | grep iperf
iperf-client-7874c47d95-t7hph 1/1 Running 0 5m58s
iperf-server-deployment-74d94bdd59-dzdl4 1/1 Running 0 5m58s
kubectl exec iperf-client-7874c47d95-t7hph -- iperf -c 10.1.0.173 -p 5001 -i 10 -t 100
------------------------------------------------------------
Client connecting to 10.1.0.173, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 1] local 10.1.0.172 port 56296 connected with 10.1.0.173 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.00 sec 19.7 GBytes 16.9 Gbits/sec
[ 1] 10.00-20.00 sec 18.9 GBytes 16.2 Gbits/sec
[ 1] 20.00-30.00 sec 20.0 GBytes 17.2 Gbits/sec
[ 1] 30.00-40.00 sec 20.4 GBytes 17.5 Gbits/sec
[ 1] 40.00-50.00 sec 18.5 GBytes 15.9 Gbits/sec
[ 1] 50.00-60.00 sec 19.3 GBytes 16.5 Gbits/sec
[ 1] 60.00-70.00 sec 17.6 GBytes 15.1 Gbits/sec
[ 1] 70.00-80.00 sec 17.1 GBytes 14.7 Gbits/sec
[ 1] 80.00-90.00 sec 18.4 GBytes 15.8 Gbits/sec
[ 1] 90.00-100.00 sec 15.1 GBytes 13.0 Gbits/sec
[ 1] 0.00-100.00 sec 185 GBytes 15.9 Gbits/sec
未做限流,Bandwidth
可以到15.9Gbits/sec
对于添加网络限流注解
$ kubectl get pod | grep iperf
iperf-clients-rcsh6 1/1 Running 0 7h7m
iperf-server-deployment-59675c8f78-g52pm 1/1 Running 0 6h52m
$ kubectl exec iperf-clients-rcsh6 -- iperf -c 10.1.0.170 -p 5001 -i 10 -t 100
------------------------------------------------------------
Client connecting to 10.1.0.170, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 1] local 10.1.0.170 port 54652 connected with 10.1.0.170 port 5001
[ ID] Interval Transfer Bandwidth
[ 1] 0.00-10.00 sec 3.50 MBytes 2.94 Mbits/sec
[ 1] 10.00-20.00 sec 2.25 MBytes 1.89 Mbits/sec
[ 1] 20.00-30.00 sec 2.04 MBytes 1.71 Mbits/sec
[ 1] 30.00-40.00 sec 892 KBytes 731 Kbits/sec
[ 1] 40.00-50.00 sec 954 KBytes 781 Kbits/sec
[ 1] 50.00-60.00 sec 1.36 MBytes 1.14 Mbits/sec
[ 1] 60.00-70.00 sec 1.18 MBytes 993 Kbits/sec
[ 1] 70.00-80.00 sec 87.1 KBytes 71.4 Kbits/sec
[ 1] 80.00-90.00 sec 0.000 Bytes 0.000 bits/sec
[ 1] 90.00-100.00 sec 2.97 MBytes 2.50 Mbits/sec
[ 1] 0.00-100.69 sec 15.5 MBytes 1.29 Mbits/sec
限制1Mbits/sec
, 流控真实表现是 1.29 Mbits/sec
- 为啥限制
1Mbits/sec
, 流控真实表现略大约1Mbits/sec
? - 原因:在Linux系统中,
1M = 1024k
的; 而 K8s中使用Resource
对象实现的1M = 1000k
的. - 因此,真实 设置
1Mbits/sec
在 Linux 中的表现应该是1024*1024(bits/sec)/(1000*1000) = 1.048Mbits/sec
. - 在0-1s之间,TC控制不准确,会有数据平均
增大
的问题
总结
-
- docker
1.18
支持runc runtime json传递;containerd作为runtime,1.4
版本才能支持;
- docker
-
- calico需要
2.1
版本; cilium需要1.12.90
版本; kube-ovn需要版本1.9.0版本;但是需要支持
`ovn.kubernetes.io/ingress_rate` : Ingress 流量的速率限制,单位:Mbits/s `ovn.kubernetes.io/egress_rate` : Egress 流量的速率限制,单位:Mbits/s
- calico需要
-
- 不能动态更新annotation里面的流量限制大小,更新之后必须删除pod重建;
因此,需要通过webhook
来将丰富配置namespcae
下的limitrange
含义拉齐, 并支持默认填充
具体实现方式
先通过 CRD
描述 namespace
下 limitrange
扩展限制
设计如下:
apiVersion: custom.xxx.com/v1
kind: CustomLimitRange
metadata:
name: test-rangelimit
spec:
limitrange:
type: pod # 对pod类型限制,以后扩展到 contianer类型、ingress类型、service类型
max: # max和min是限制的上下线,如果pod自定义的值不在其中,ValidatingAdmissionWebhook校验报错
ingress-bandwidth: "1G"
egress-bandwidth: "1G"
min:
ingress-bandwidth: "10M"
egress-bandwidth: "10M"
default: # 定义了default,如果pod annotation为空,MutatingAdmissionWebhook自动注入此数据;未定义default,不作强注入操作
ingress-bandwidth: "128M"
egress-bandwidth: "128M"
在pod
可以是支持设置 customlimitrange.kubernetes.io/limited : disable
, 可支持 ignore
namespace下CustomLimitRange
限制
注意
本身CustomLimitRange
自身校验必不可少:
- max value >= default value >= min value
- value range [1k, 1P] && value 类型 Kbits/sec, Mbits/sec, Gbits/sec, Tbits/sec 和 Pbits/sec
- type 类型 enum
- max、min 和 default 可缺省
- 内部适配:kube-ovn annotation
使用方式
-
- Pod和Deployment添加注解annotation
# Pod
apiVersion: v1
kind: Pod
metadata:
name: xxxx
annotations:
kubernetes.io/ingress-bandwidth: 1M
kubernetes.io/egress-bandwidth: 1M
...
# Deployment
...
spec:
template:
metadata:
#添加注解
annotations:
kubernetes.io/ingress-bandwidth: 1M
kubernetes.io/egress-bandwidth: 1M
...
-
- 通过定义Custom LimitRange 自动添加annotation. 如以上