Skip to content

nvidia-device-plugin Daemonset custom tolerations #17329

Open
@yuzhouliu9

Description

@yuzhouliu9

/kind feature

1. Describe IN DETAIL the feature/behavior/change you would like to see.
After following the GPU Support doc and adding below in Cluster spec

  containerd:
    nvidiaGPU:
      enabled: true

kOps installs the nvidia-device-plugin daemonset but with only a single toleration.

      tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

There doesn't seem to be a way to add custom tolerations or a wider tolerate all like below.

  tolerations:
  # This toleration will allow the gpu hook to run anywhere
  #   By default this is permissive in case you have tainted your GPU nodes.
  - operator: "Exists"

Can kOps expose the ability to add custom tolerations to nvidia-device-driver daemonset?

2. Feel free to provide a design supporting your feature request.
We have many different node types, AMD64, ARM64, different GPU types.
We taint the nodes with different combinations, for example:

  taints:
  - kubernetes.io/arch=arm64:NoSchedule
  - nvidia.com/gpu:NoSchedule

Currently this won't work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions