Open
Description
/kind feature
1. Describe IN DETAIL the feature/behavior/change you would like to see.
After following the GPU Support doc and adding below in Cluster spec
containerd:
nvidiaGPU:
enabled: true
kOps installs the nvidia-device-plugin daemonset but with only a single toleration.
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
There doesn't seem to be a way to add custom tolerations or a wider tolerate all like below.
tolerations:
# This toleration will allow the gpu hook to run anywhere
# By default this is permissive in case you have tainted your GPU nodes.
- operator: "Exists"
Can kOps expose the ability to add custom tolerations to nvidia-device-driver daemonset?
2. Feel free to provide a design supporting your feature request.
We have many different node types, AMD64, ARM64, different GPU types.
We taint the nodes with different combinations, for example:
taints:
- kubernetes.io/arch=arm64:NoSchedule
- nvidia.com/gpu:NoSchedule
Currently this won't work.