ci: restart cilium ds after node restart #3621

camrynl · 2025-05-01T22:17:53Z

Reason for Change:

Update release testing ci. Restart cilium daemonset after node restarts to clean up old states/endpoints before entering state file check.

Issue Fixed:

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
relevant PR labels added

Notes:

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

camrynl · 2025-05-01T22:50:52Z

/azp run Azure Container Networking PR

azure-pipelines · 2025-05-01T22:51:02Z

Azure Pipelines successfully started running 1 pipeline(s).

vipul-21 · 2025-05-01T23:17:46Z

.pipelines/cni/load-test-templates/restart-node-template.yaml

@@ -34,8 +34,17 @@ steps:
          done
        fi

+        # Restart cilium if it is installed, bpf maps and endpoint states can be stale after a node restart (versions < v1.17)


If node is restarting won't that automatically restart Cilium ?

yes cilium is initially restarted but somehow the old endpoints/states are not all reset.
The validate state scenario is failing and the clusters have 1-2 pods marked Unknown and their cilium endpoints are left waiting-for-identity .

In agent logs it shows it seems to be stuck in a create/delete loop.

time="2025-04-30T16:54:27Z" level=warning msg="Cancelled endpoint create request due to receiving endpoint delete request" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=daemon time="2025-04-30T16:54:27Z" level=warning msg="Unable to release endpoint ID" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x cniAttachmentID="d94a49c4398b79442842dfc1d17793d6a9abf9dc5e618f769a62123fc5d9b0e3:eth0" error="Unable to release endpoint ID 219" state=waiting-for-identity subsys=endpoint-manager time="2025-04-30T16:54:27Z" level=info msg="Removed endpoint" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=endpoint time="2025-04-30T16:54:27Z" level=warning msg="Ignoring error while deleting endpoint" endpointID=219 error="Unable to delete key 192.168.7.61:0 from /sys/fs/bpf/tc/globals/cilium_lxc: unable to delete element 192.168.7.61:0 from map cilium_lxc: delete: key does not exist" subsys=daemon time="2025-04-30T16:54:27Z" level=warning msg="Error changing endpoint identity" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 error="unable to resolve identity: exponential backoff cancelled via context: context canceled" identityLabels="k8s:io.cilium.k8s.namespace.labels.addonmanager.kubernetes.io/mode=Reconcile,k8s:io.cilium.k8s.namespace.labels.control-plane=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.azure.com/managedby=aks,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/cluster-service=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=metrics-server,k8s:io.kubernetes.pod.namespace=kube-system,k8s:k8s-app=metrics-server,k8s:kubernetes.azure.com/managedby=aks" ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=endpoint

There is attempt to delete the endpoint but it doesn't exist in the map.. so the error is later ignored and it keeps reattempting to create endpoint again.

When I delete the broken pods or restart cilium the issue goes away and endpoints are restored

Can we investigate why the pod status is Unknown ? Concern is we are now restarting CA two times to make the test happy ?. There might be an underlying issue which we are missing here.

somehow the old endpoints/states are not all reset.

Weird looks like an issue in CA then.

yes, will investigate further. marking this change as draft for now

ci: restart cilium ds after node restart

d156ef2

camrynl added the ci Infra or tooling. label May 1, 2025

Copilot AI review requested due to automatic review settings May 1, 2025 22:17

camrynl requested a review from a team as a code owner May 1, 2025 22:17

camrynl requested a review from snguyen64 May 1, 2025 22:17

Copilot AI reviewed May 1, 2025

View reviewed changes

jpayne3506 approved these changes May 1, 2025

View reviewed changes

vipul-21 reviewed May 1, 2025

View reviewed changes

camrynl marked this pull request as draft May 2, 2025 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: restart cilium ds after node restart #3621

ci: restart cilium ds after node restart #3621

camrynl commented May 1, 2025

Copilot AI left a comment

camrynl commented May 1, 2025

azure-pipelines bot commented May 1, 2025

vipul-21 May 1, 2025

camrynl May 1, 2025

vipul-21 May 2, 2025

camrynl May 2, 2025

ci: restart cilium ds after node restart #3621

Are you sure you want to change the base?

ci: restart cilium ds after node restart #3621

Conversation

camrynl commented May 1, 2025

Copilot AI left a comment

Choose a reason for hiding this comment

camrynl commented May 1, 2025

azure-pipelines bot commented May 1, 2025

vipul-21 May 1, 2025

Choose a reason for hiding this comment

camrynl May 1, 2025

Choose a reason for hiding this comment

vipul-21 May 2, 2025

Choose a reason for hiding this comment

camrynl May 2, 2025

Choose a reason for hiding this comment