一、pod调度概述
在默认情况下,一个Pod在哪个Node节点上运行,是由Scheduler组件采用相应的算法计算出来的,这个过程是不受人工控制的。但是在实际使用中,这并不满足的需求,因为很多情况下,我们想控制某些Pod到达某些节点上,那么应该怎么做呢?这就要求了解kubernetes对Pod的调度规则。
kubernetes提供了四大类调度方式:
-
自动调度:运行在哪个节点上完全由Scheduler经过一系列的算法计算得出
-
定向调度:NodeName、NodeSelector
-
亲和性调度:NodeAffinity、PodAffinity、PodAntiAffinity
-
污点(容忍)调度:Taints、Toleration
二、定向调度
定向调度,指的是利用在pod上声明 nodeName 或者 nodeSelector ,以此将Pod调度到期望的node节点上。
注意,这里的调度是强制的,这就意味着即使要调度的目标Node不存在,也会向上面进行调度,只不过pod运行失败而已。
(一)nodeName
NodeName用于强制约束将Pod调度到指定的Name的Node节点上。
这种方式,其实是直接跳过Scheduler的调度逻辑,直接将Pod调度到指定名称的节点。
[root@k8s-master ~]# vim pod-nodename.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodename
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
imagePullPolicy: IfNotPresent
ports:
- name: nginx-port
containerPort: 801
nodeName: k8s-node01 # 指定将该pod调度到k8s-node01节点上。nodeName键所指定的值必须是 kubectl get node 命令所查出来的,nodeName是固定写法
[root@k8s-master ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 5d20h v1.17.4
k8s-node01 Ready <none> 5d20h v1.17.4
k8s-node02 Ready <none> 5d20h v1.17.4
[root@k8s-master ~]# kubectl create -f pod-nodename.yaml
pod/pod-nodename created
[root@k8s-master ~]# kubectl get pod pod-nodename -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodename 1/1 Running 0 17s 10.244.1.36 k8s-node01 <none> <none>
[root@k8s-master ~]# kubectl delete -f pod-nodename.yaml
pod "pod-nodename" deleted
# 指定将pod运行在一个不属于集群中的某节点上
[root@k8s-master ~]# vim pod-nodename.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodename
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
imagePullPolicy: IfNotPresent
ports:
- name: nginx-port
containerPort: 80
nodeName: node100
[root@k8s-master ~]# kubectl create -f pod-nodename.yaml
pod/pod-nodename created
[root@k8s-master ~]# kubectl get pod pod-nodename -n test -o wide -w
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodename 0/1 Pending 0 29s <none> node100 <none> <none>
pod-nodename 0/1 Terminating 0 56s <none> node100 <none> <none>
pod-nodename 0/1 Terminating 0 56s <none> node100 <none> <none>
# 因为没有该节点,所以无法调度到,过一会就会自动删除该pod
(二)nodeSelector
NodeSelector用于将pod调度到添加了指定标签的node节点上。
它是通过kubernetes的label-selector机制实现的,也就是说,在pod创建之前,会由sceduler使用MatchNodeSelector调度策略进行label匹配,找出目标node,然后将pod调度到目标节点,该匹配规则是强制约束,没有则失败。
首先为node节点添加标签:
[root@k8s-master ~]# kubectl label node k8s-node01 nodeenv=pro
node/k8s-node01 labeled
[root@k8s-master ~]# kubectl label node k8s-node02 nodeenv=test
node/k8s-node02 labeled
# 查看
[root@k8s-master ~]# kubectl get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
k8s-master Ready master 5d20h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
k8s-node01 Ready <none> 5d20h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node01,kubernetes.io/os=linux,nodeenv=pro
k8s-node02 Ready <none> 5d20h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node02,kubernetes.io/os=linux,nodeenv=test
创建使用nodeSelector调度pod的配置文件:
[root@k8s-master ~]# vim pod-nodeselector.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeselector
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
imagePullPolicy: IfNotPresent
nodeSelector:
nodeenv: pro # 指定调度到具有 nodeenv=pro 标签的节点上。键值对必须存在,不存在调度会失败
# 创建
[root@k8s-master ~]# kubectl create -f pod-nodeselector.yaml
pod/pod-nodeselector created
# 调度到指定节点上
[root@k8s-master ~]# kubectl get pod pod-nodeselector -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeselector 1/1 Running 0 16s 10.244.1.37 k8s-node01 <none> <none>
# 删除
[root@k8s-master ~]# kubectl delete -f pod-nodeselector.yaml
pod "pod-nodeselector" deleted
# 指定pod调度带到在不存在的节点标签上
[root@k8s-master ~]# vim pod-nodeselector.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeselector
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
imagePullPolicy: IfNotPresent
nodeSelector:
nodeenv: pro100 # 不存在的节点标签
[root@k8s-master ~]# kubectl create -f pod-nodeselector.yaml
pod/pod-nodeselector created
# 找不带此node节点,调度失败
[root@k8s-master ~]# kubectl get pod pod-nodeselector -n test -o wide -w
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeselector 0/1 Pending 0 2m3s <none> <none> <none> <none>
[root@k8s-master ~]# kubectl describe pod pod-nodeselector -n test
……省略……
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector. # 在3个节点(1个master,2个node)上无法找到具有该标签的节点
# 虽然调度失败,但不会自动删除该pod,需手动删除
[root@k8s-master ~]# kubectl delete -f pod-nodeselector.yaml
pod "pod-nodeselector" deleted
# 删除主机标签
[root@k8s-master ~]# kubectl label node k8s-node01 nodeenv-
node/k8s-node01 labeled
[root@k8s-master ~]# kubectl label node k8s-node02 nodeenv-
node/k8s-node02 labeled
三、亲和性调度
上面基于 nodeName 和 nodeSelector 的两种定向调度的方式,使用比较方便,但是存在一定问题,那就是如果没有满足条件的node,那么pod将不会被调度和运行,即使在集群中还有可用node列表也不行,这就限制了它的使用场景。
基于此问题,kubernetes提供了一种亲和性调度(Affinity)。它在NodeSelector的基础之上的进行了扩展,可以通过配置的形式,实现优先选择满足条件的Node进行调度,如果没有,也可以调度到不满足条件的节点上,使调度更加灵活。
Affinity主要分为三类:
-
nodeAffinity(node亲和性) ∶以node为目标,解决pod可以调度到哪些node的问题
-
podAffinitypod亲和性):以pod为目标,解决pod可以和哪些已存在的pod部署在同一个拓扑域中的问题
-
podAntiAffinity(pod反亲和性):以pod为目标,解决pod不能和哪些已存在pod部署在同一个拓扑域中的问题
关于亲和性(反亲和性)使用场景的说明:
亲和性:如果两个应用频繁交互,那就有必要利用亲和性让两个应用的尽可能的靠近,这样可以减少因网络通信而带来的性能损耗。
反亲和性:当应用的采用多副本部署时,有必要采用反亲和性让各个应用实例打散分布在各个node上,这样可以提高服务的高可用性。
(一)NodeAffinity
以node为目标,解决pod可以调度到哪些node的问题
NodeAffinity的可配置项
[root@k8s-master ~]# kubectl explain pod.spec.affinity.nodeAffinity
KIND: Pod
VERSION: v1
RESOURCE: nodeAffinity <Object>
FIELDS:
requiredDuringSchedulingIgnoredDuringExecution: # Node节点必须满足指定的所有规则才可以,相当于硬限制
nodeSelectorTerms: # 节点选择列表
# matchFields # 按节点字段列出的节点选择器要求列表
- matchExpressions: # 按节点标签列出的节点选择器要求列表(推荐)
- key: # 键
values: # 值
operator: # 关系符,支持Exists、DoesNotExist、In、NotIn、Gt(大于)、Lt(小于)
preferredDuringSchedulingIgnoredDuringExecution: # 优先调度到满足指定的规则的node,相当于软限制(倾向)
- preference: # 一个节点选择器项,与相应的权重相关联
# matchFilelds # 按节点字段列出的节点选择器要求列表
matchExpressions: 按节点标签列出的节点选择器要求列表(推荐)
- key: # 键
values: # 值
operator: # 关系符,支持Exists、DoesNotExist、In、NotIn、Gt、Lt
weight: # 倾向权重,范围 1-100
关系符的说明:
- matchExpressions:
- key: nodeenv
operator: Exists # 匹配存在标签的key为nodeenv的节点
- key: nodeenv
operator: In
values: ["xxx","yyy"] # 匹配标签的key为nodeenv,且value是"xxx"或"yyy"的节
- key: nodeenv
operator: Gt
values: "xxx" # 匹配标签的key为nodeenv,且value大于"xxx"的节点
requiredDuringSchedulingIgnoredDuringExecution(硬限制)实例
# 查看节点标签
[root@k8s-master ~]# kubectl get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
k8s-master Ready master 6d19h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
k8s-node01 Ready <none> 6d18h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node01,kubernetes.io/os=linux
k8s-node02 Ready <none> 6d18h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node02,kubernetes.io/os=linux
# 给node01和node02节点打上 nodeenv 的标签
[root@k8s-master ~]# kubectl label node k8s-node01 nodeenv=pro
node/k8s-node01 labeled
[root@k8s-master ~]# kubectl label node k8s-node02 nodeenv=test
node/k8s-node02 labeled
# 标记成功
[root@k8s-master ~]# kubectl get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
k8s-master Ready master 6d19h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
k8s-node01 Ready <none> 6d18h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node01,kubernetes.io/os=linux,nodeenv=pro
k8s-node02 Ready <none> 6d18h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node02,kubernetes.io/os=linux,nodeenv=test
# 编写配置文件
[root@k8s-master ~]# vim pod-nodeaffinity-required.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeaffinity-required
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
imagePullPolicy: IfNotPresent
affinity: # 亲和性设置
nodeAffinity: # 设置node亲和性
requiredDuringSchedulingIgnoredDuringExecution: # 硬限制
nodeSelectorTerms:
- matchExpressions: # 匹配节点主机标签的键为nodeenv,值为"xxx"或"yyy"的主机
- key: nodeenv
operator: In
values: ["xxx","yyy"]
# 创建
[root@k8s-master ~]# kubectl create -f pod-nodeaffinity-required.yaml
pod/pod-nodeaffinity-required created
# 查看,创建失败
[root@k8s-master ~]# kubectl get pod pod-nodeaffinity-required -n test
NAME READY STATUS RESTARTS AGE
pod-nodeaffinity-required 0/1 Pending 0 77s
# 查看详情
[root@k8s-master ~]# kubectl describe pod pod-nodeaffinity-required -n test
……省略……
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
# 三个主机均不满足匹配条件,调度失败
# 删除
[root@k8s-master ~]# kubectl delete -f pod-nodeaffinity-required.yaml
pod "pod-nodeaffinity-required" deleted
# 重新修改配置文件
[root@k8s-master ~]# vim pod-nodeaffinity-required.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeaffinity-required
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
imagePullPolicy: IfNotPresent
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nodeenv
operator: In
values: ["pro","yyy"] # 匹配节点主机标签的键为nodeenv,值为"pro"或"yyy"的主机
# 创建
[root@k8s-master ~]# kubectl create -f pod-nodeaffinity-required.yaml
pod/pod-nodeaffinity-required created
# 查看,运行成功且调度在k8s-node01节点上
[root@k8s-master ~]# kubectl get pod pod-nodeaffinity-required -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeaffinity-required 1/1 Running 0 35s 10.244.1.38 k8s-node01 <none> <none>
# 因为k8s-node01节点上存在标签 nodeenv=pro
[root@k8s-master ~]# kubectl get node -l nodeenv=pro --show-labels
NAME STATUS ROLES AGE VERSION LABELS
k8s-node01 Ready <none> 6d19h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node01,kubernetes.io/os=linux,nodeenv=pro
# 删除
[root@k8s-master ~]# kubectl delete -f pod-nodeaffinity-required.yaml
pod "pod-nodeaffinity-required" deleted
preferredDuringSchedulingIgnoredDuringExecution(软限制)实例
# 查看node标签
[root@k8s-master ~]# kubectl get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
k8s-master Ready master 6d19h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
k8s-node01 Ready <none> 6d19h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node01,kubernetes.io/os=linux,nodeenv=pro
k8s-node02 Ready <none> 6d19h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8s-node02,kubernetes.io/os=linux,nodeenv=test
# 编写配置文件
[root@k8s-master ~]# vim pod-nodeaffinity-preferred.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeaffinity-preferred
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
imagePullPolicy: IfNotPresent
affinity: # 亲和性设置
nodeAffinity: # 设置node亲和性
preferredDuringSchedulingIgnoredDuringExecution: # 软限制
- weight: 1
preference:
matchExpressions: # 匹配节点主机标签的键为nodeenv,值为"xxx"或"yyy"的主机(当前环境中没有)
- key: nodeenv
operator: In
values: ["xxx","yyy"]
# 创建
[root@k8s-master ~]# kubectl create -f pod-nodeaffinity-preferred.yaml
pod/pod-nodeaffinity-preferred created
# 查看,运行成功,调度到k8s-node02节点上,软限制即使没有满足匹配条件的,也会调度到不满足条件的节点上。如果有满足条件的优先调度到满足条件的节点上
[root@k8s-master ~]# kubectl get pod pod-nodeaffinity-preferred -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeaffinity-preferred 1/1 Running 0 3s 10.244.2.33 k8s-node02 <none> <none>
# 删除
[root@k8s-master ~]# kubectl delete -f pod-nodeaffinity-preferred.yaml
pod "pod-nodeaffinity-preferred" deleted
NodeAffinity 规则设置的注意事项:
- 如果同时定义了 nodeSelector 和 nodeAffinity ,那么必须两个条件都得到满足,pod 才能运行在指定的 node 上
- 如果 nodeAffinity 指定了多个 nodeSelectorTerms ,那么只需要其中一个能够匹配成功即可
- 如果一个 nodeSelectorTerms 中有多个 matchExpressions ,则一个节点必须满足所有的才能匹配成功
- 如果一个 pod 所在的 node 在 pod 运行期间其标签发生了改变,不再符合该 pod 的节点亲和性需求,则系统将忽略此变化
(二)PodAffinity
PodAffinity 主要实现以一个运行的pod为参照,让新创建的pod跟参照pod在一个拓扑域中的功能
PodAffinity 的可配置项:
[root@k8s-master ~]# kubectl explain pod.spec.affinity.podAffinity
KIND: Pod
VERSION: v1
RESOURCE: podAffinity <Object>
FIELDS:
requiredDuringSchedulingIgnoredDuringExecution: # 硬限制
- labelSelector: # 标签选择器
matchExpressions: # 按节点标签列出的节点选择器要求列表(推荐)
- key: # 键
values: # 值
operator: # 关系符,支持In,NotIn,Exists,DoesNotExist
# matchLabels: # 指多个matchExpressions映射的内容
namespaces: # 指定参照pod的namespace
topologyKey: # 指定调度作用域。kubernetes.io/hostname 表示以node节点为区分范围;beta.kubernetes.io/os 表示以node节点的操作系统类型来区分
preferredDuringSchedulingIgnoredDuringExecution: # 软限制
podAffinityTerm: # 选项
namespaces:
topologyKey:
labelSelector:
matchExpressions: # 按节点标签列出的节点选择器要求列表(推荐)
key: # 键
values: # 值
operator: # 关系符,支持In,NotIn,Exists,DoesNotExist
# matchLabels: # 指多个matchExpressions映射的内容
weight # 倾向权重,1-100
requiredDuringSchedulingIgnoredDuringExecution(硬限制)实例
# 创建用于参照的目标pod
[root@k8s-master ~]# vim pod-podaffinity-target.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-target
namespace: test
labels:
podenv: pro # 设置标签,用于后面使用podaffinity运行起来的pod以该pod为参照的依据或联系
spec:
containers:
- name: nginx
image: nginx:1.17.1
nodeName: k8s-node01
# 启动目标pod
[root@k8s-master ~]# kubectl create -f pod-podaffinity-target.yaml
pod/pod-podaffinity-target created
# 查看
[root@k8s-master ~]# kubectl get pod pod-podaffinity-target -n test -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
pod-podaffinity-target 1/1 Running 0 2m40s 10.244.1.39 k8s-node01 <none> <none> podenv=pro
# 创建使用podaffinity硬限制调度策略的pod
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-required
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
affinity: # 亲和性设置
podAffinity: # 设置pod亲和性
requiredDuringSchedulingIgnoredDuringExecution: # 硬限制
- labelSelector:
matchExpressions: # 匹配目标pod的标签其键为podenv ,值在[“xxx”,"yyy"]
- key: podenv
operator: In
values: ["xxx","yyy"]
topologyKey: kubernetes.io/hostname # 作用域,条件满足将该pod与目标pod运行在同一节点上
# 上面配置表达的意思是:新pod必须要与拥有标签nodeenv=xxx或podeenv=yyy的pod调度运行在同一node上
[root@k8s-master ~]# kubectl create -f pod-podaffinity-required.yaml
pod/pod-podaffinity-required created
# 调度失败
[root@k8s-master ~]# kubectl get pod pod-podaffinity-required -n test -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
pod-podaffinity-required 0/1 Pending 0 44s <none> <none> <none> <none> <none>
# 查看原因
[root@k8s-master ~]# kubectl describe pod pod-podaffinity-required -n test
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity rules.
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity rules.
# master主节点有污点,其余2个节点无法匹配上具有该亲和性规则的pod
# 虽然上面创建并运行的目标pod具有标签podenv=pro,键符合,但是值不符合,上面的亲和性规则要求的是键为 podenv ,值为 xxx 或者 yyy
# 删除该pod,修改配置文件中的pod亲和性规则
[root@k8s-master ~]# kubectl delete -f pod-podaffinity-required.yaml
pod "pod-podaffinity-required" deleted
[root@k8s-master ~]# vim pod-podaffinity-required.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-required
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: podenv
operator: In
values: ["pro","yyy"] # 修改为目标pod对应的值
topologyKey: kubernetes.io/hostname
# 查看,调度运行成功,并且与目标pod调度在同一节点
[root@k8s-master ~]# kubectl get pod pod-podaffinity-required -n test -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
pod-podaffinity-required 1/1 Running 0 21s 10.244.1.41 k8s-node01 <none> <none> <none>
[root@k8s-master ~]# kubectl get pod -n test -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
pod-podaffinity-required 1/1 Running 0 26s 10.244.1.41 k8s-node01 <none> <none> <none>
pod-podaffinity-target 1/1 Running 0 5m29s 10.244.1.40 k8s-node01 <none> <none> podenv=pro
# 删除
[root@k8s-master ~]# kubectl delete -f pod-podaffinity-required.yaml
pod "pod-podaffinity-required" deleted
[root@k8s-master ~]# kubectl delete -f pod-podaffinity-target.yaml
pod "pod-podaffinity-target" deleted
(三)podAntiAffinity
podAntiAffinity (pod反亲和性)主要实现以运行的Pod为参照,让新创建的pod调度到与参照的pod不在一个区域中。
可配置项同podAffinity
# 创建参照pod
[root@k8s-master ~]# vim pod-podaffinity-target.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-target
namespace: test
labels:
podenv: pro # 设置标签
spec:
containers:
- name: nginx
image: nginx:1.17.1
nodeName: k8s-node01
# 运行
[root@k8s-master ~]# kubectl create -f pod-podaffinity-target.yaml
pod/pod-podaffinity-target created
# 查看
[root@k8s-master ~]# kubectl get pod pod-podaffinity-target -n test -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
pod-podaffinity-target 1/1 Running 0 34s 10.244.1.46 k8s-node01 <none> <none> podenv=pro
# 创建基于pod反亲和性调度规则的pod
[root@k8s-master ~]# vim pod-podantiaffinity-required.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podantiaffinity-required
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
affinity: # 亲和性设置
podAntiAffinity: # 设置pod反亲和性
requiredDuringSchedulingIgnoredDuringExecution: # 硬限制
- labelSelector:
matchExpressions: # 匹配键为podenv,值为pro的pod
- key: podenv
operator: In
values: ["pro"]
topologyKey: kubernetes.io/hostname
# 该配置规则意思为:该创建的pod必须与拥有标签nodeenv=pro的pod不在同一个node上
# 创建
[root@k8s-master ~]# kubectl create -f pod-podantiaffinity-required.yaml
pod/pod-podantiaffinity-required created
# 查看
[root@k8s-master ~]# kubectl get pod -n test -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
pod-podaffinity-target 1/1 Running 0 9m1s 10.244.1.46 k8s-node01 <none> <none> podenv=pro
pod-podantiaffinity-required 1/1 Running 0 91s 10.244.2.36 k8s-node02 <none> <none> <none>
# 可以看到新创建的pod运行在k8s-node02节点上,与拥有标签nodeenv=pro的pod运行在不同的node之上
四、污点和容忍
(一)污点(Taints)
前面的调度方式都是站在Pod的角度上,通过在Pod上添加属性,来确定Pod是否要调度到指定的Node上,其实我们也可以站在Node的角度上,通过在Node上添加污点属性,来决定是否允许Pod调度过来。
Node被设置上污点之后就和Pod之间存在了一种相斥的关系,进而拒绝Pod调度进来,甚至可以将已经存在的Pod驱逐出去。
污点的格式为:
key=value:effect
# key 和 value 是污点的标签。effect 描述污点的作用域,effect支持如下三个选项:
PreferNoSchedule
# kubernetes将尽量避免把Pod调度到具有该污点的Node上,除非没有其他节点可调度
NoSchedule
# kubernetes将不会把Pod调度到具有该污点的Node上,但不会影响当前Node上已存在的Pod
NoExecute
# kubernetes将不会把Pod调度到具有该污点的Node上,同时也会将Node上已存在的Pod驱离
使用kubectl设置和去除污点的命令:
#设置污点
kubectl taint node 节点名 key=value:effect
#去除污点
kubectl taint node 节点名 key:effect-
#去除所有污点
kubectl taint node 节点名 key-
演示
- 准备节点 k8s-node01(为了效果更加明显,暂时停止k8s-node02节点)
- 为k8s-node01节点设置一个污点:tag=test:PreferNoSchedule;然后创建pod1(pod1可以)
- 修改为k8s-node01节点设置一个污点:tag=test:NoSchedule;然后创建pod2(pod1正常 pod2失败)
- 修改为k8s-node01节点设置一个污点:tag=test:NoExecute;然后创建pod3(3个都失败)
[root@k8s-master ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 9d v1.17.4
k8s-node01 Ready <none> 9d v1.17.4
k8s-node02 NotReady <none> 9d v1.17.4
# 为k8s-node01节点设置污点(PreferNoSchedule)
[root@k8s-master ~]# kubectl taint node k8s-node01 tag=test:PreferNoSchedule
node/k8s-node01 tainted
# 查看
[root@k8s-master ~]# kubectl describe node k8s-node01 | grep Taints
Taints: tag=test:PreferNoSchedule
# 创建名为taint1的pod
[root@k8s-master ~]# kubectl run taint1 --image=nginx:1.17.1 -n test
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
deployment.apps/taint1 created
# 调度到k8s-node01,因为k8s-node02节点不可用,且k8s-master节点本身默认带有污点NoSchedule。而k8s-node01节点的污点作用域为PreferNoSchedule(除非没有可用的节点才允许pod调度过来),且其他两个节点都不能让taint1这个pod调度过去,所以创建的taint1只能调度到k8s-node01节点上
[root@k8s-master ~]# kubectl describe node k8s-master | grep Taints
Taints: node-role.kubernetes.io/master:NoSchedule
[root@k8s-master ~]# kubectl get pod -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
taint1-766c47bf55-k9zgg 1/1 Running 0 12s 10.244.1.50 k8s-node01 <none> <none>
# 为k8s-node01 节点设置污点(取消PreferNoSchedule,设置NoSchedule)
[root@k8s-master ~]# kubectl taint node k8s-node01 tag:PreferNoSchedule-
node/k8s-node01 untainted
[root@k8s-master ~]# kubectl describe node k8s-node01 | grep Taints
Taints: <none>
[root@k8s-master ~]# kubectl taint node k8s-node01 tag=test:NoSchedule
node/k8s-node01 tainted
[root@k8s-master ~]# kubectl describe node k8s-node01 | grep Taints
Taints: tag=test:NoSchedule
# 再次查看taint1这个pod情况
[root@k8s-master ~]# kubectl get pod -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
taint1-766c47bf55-k9zgg 1/1 Running 0 8m32s 10.244.1.50 k8s-node01 <none> <none>
# 虽然node01节点的污点作用域改为NoSchedule,但是taint1这个pod还是可以继续运行下去,
# 因为NoSchedule只是让新创建的pod无论如何都不能调度到node01节点上,而对于之前已经运行在此节点上的pod不会受影响
# 创建名为taint2的pod
[root@k8s-master ~]# kubectl run taint2 --image=nginx:1.17.1 -n test
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
deployment.apps/taint2 created
# 查看taint2
[root@k8s-master ~]# kubectl get pod -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
taint1-766c47bf55-k9zgg 1/1 Running 0 17m 10.244.1.50 k8s-node01 <none> <none>
taint2-84946958cf-ttp87 0/1 Pending 0 8m50s <none> <none> <none> <none>
[root@k8s-master ~]# kubectl describe pod taint2-84946958cf-ttp87 -n test
……省略……
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
# 因为master节点有NoSchedule污点,node02节点不可用,node01节点的污点也为NoSchedule,
# 所以taint2这个pod无法调度到任何节点上
# 为k8s-node01节点再次修改污点作用域(取消NoSchedule,改为NoExecute)
[root@k8s-master ~]# kubectl taint node k8s-node01 tag:NoSchedule-
node/k8s-node01 untainted
[root@k8s-master ~]# kubectl taint node k8s-node01 tag=test:NoExecute
node/k8s-node01 tainted
[root@k8s-master ~]# kubectl describe node k8s-node01 | grep Taints
Taints: tag=test:NoExecute
# 此时再查看taint1和taint2这两个pod的情况
[root@k8s-master ~]# kubectl get pod -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
taint1-766c47bf55-z7dcs 0/1 Pending 0 85s <none> <none> <none> <none>
taint2-84946958cf-xbrc7 0/1 Pending 0 85s <none> <none> <none> <none>
# 因为node02不可用,master有污点,而node01的污点作用域又改为NoExecute,所以即使taint1这个pod之前就已经运行node01上了,也会被驱逐出去
# 创建名为taint3的pod
[root@k8s-master ~]# kubectl run taint3 --image=nginx:1.17.1 -n test
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
deployment.apps/taint3 created
# [root@k8s-master ~]# kubectl get pod -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
taint1-766c47bf55-z7dcs 0/1 Pending 0 85s <none> <none> <none> <none>
taint2-84946958cf-xbrc7 0/1 Pending 0 85s <none> <none> <none> <none>
taint3-57d45f9d4c-nx72m 0/1 Pending 0 26s <none> <none> <none> <none>
[root@k8s-master ~]# kubectl describe pod taint3-57d45f9d4c-nx72m -n test
……省略……
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
# 同理,NoExecute作用域的污点不仅会驱逐之前的pod,新创建的pod也不能调度其上,即使没有可用的节点
# 去除污点
[root@k8s-master ~]# kubectl taint node k8s-node01 tag-
node/k8s-node01 untainted
[root@k8s-master ~]# kubectl describe node k8s-node01 | grep Taints
Taints: <none>
# 使用kubeadm搭建的集群,默认就会给master节点加上一个污点标记NoSchedule,所以pod就不会调度到master节点上
(二)容忍(Toleration)
污点的作用是可以在节点上添加污点作用用于拒绝pod调度上来,但是如果就是想将一个pod调度到一个有污点的node上去,这时就可以使用到容忍
污点就是拒绝容忍,容忍就是忽略,Node通过污点拒绝pod调度到其上,pod通过容忍忽略node的拒绝
配置项:
[root@k8s-master ~]# kubectl explain pod.spec.tolerations
KIND: Pod
VERSION: v1
RESOURCE: tolerations <[]Object>
tolerations: # 容忍规则
- key: "string" # 对应着要容忍的污点的键,如果为空则意味着匹配所有的键
value: "string" # 对应着要容忍的污点的值
operator: "string" # key-value的运算符,支持Equal和Exists(默认)。如果是Exists,那么就只匹配键,与值无关
effect: "string" # 对应污点的effect,空意味着匹配所有影响
tolerationSeconds: integer # 容忍时间,当effect为NoExecute时生效,表示pod在node上的停留时间。即创建的pod可以调度到影响域为NoExecute的节点上,但是只允许运行该时长后再进行驱逐
【例 】
[root@k8s-master ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
k8s-master Ready master 9d v1.17.4
k8s-node01 Ready <none> 9d v1.17.4
k8s-node02 NotReady <none> 9d v1.17.4
[root@k8s-master ~]# kubectl describe node k8s-master | grep Taints
Taints: node-role.kubernetes.io/master:NoSchedule
[root@k8s-master ~]# kubectl describe node k8s-node01 | grep Taints
Taints: tag=test:NoExecute
# k8s-master节点污点规则为NoSchedule,k8s-node02节点不可用,k8s-node01节点污点规则为NoExecute
# 创建不添加容忍规则的pod
[root@k8s-master ~]# vim pod-toleration.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-toleration
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
[root@k8s-master ~]# kubectl create -f pod-toleration.yaml
pod/pod-toleration created
# 无法调度成功
[root@k8s-master ~]# kubectl get pod pod-toleration -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-toleration 0/1 Pending 0 47s <none> <none> <none> <none>
# 查看k8s-node01节点的污点信息
[root@k8s-master ~]# kubectl describe node k8s-node01 | grep Taints
Taints: tag=test:NoExecute
# 修改配置文件,为该pod添加容忍规则
[root@k8s-master ~]# kubectl delete -f pod-toleration.yaml
[root@k8s-master ~]# vim pod-toleration.yaml
pod "pod-toleration" deleted
apiVersion: v1
kind: Pod
metadata:
name: pod-toleration
namespace: test
spec:
containers:
- name: nginx
image: nginx:1.17.1
tolerations: # 添加容忍
- key: "tag" # 要容忍的污点的key
operator: "Equal" # 操作符
value: "test" # 容忍的污点的value
effect: "NoExecute" # 添加容忍的规则,必须和标记的污点规则相同。
# 其实就是容忍 node01节点上的污点规则 tag=test:NoExecute ,使得该pod可以调度到node01节点上
# 创建
[root@k8s-master ~]# kubectl create -f pod-toleration.yaml
pod/pod-toleration created
# 此时就可以正常调度运行
[root@k8s-master ~]# kubectl get pod pod-toleration -n test -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-toleration 1/1 Running 0 15s 10.244.1.56 k8s-node01 <none> <none>