ci: restart cilium ds after node restart #3621

vipul-21 · 2025-05-01T23:17:46Z

If node is restarting won't that automatically restart Cilium ?

yes cilium is initially restarted but somehow the old endpoints/states are not all reset.
The validate state scenario is failing and the clusters have 1-2 pods marked Unknown and their cilium endpoints are left waiting-for-identity .

In agent logs it shows it seems to be stuck in a create/delete loop.

time="2025-04-30T16:54:27Z" level=warning msg="Cancelled endpoint create request due to receiving endpoint delete request" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=daemon time="2025-04-30T16:54:27Z" level=warning msg="Unable to release endpoint ID" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x cniAttachmentID="d94a49c4398b79442842dfc1d17793d6a9abf9dc5e618f769a62123fc5d9b0e3:eth0" error="Unable to release endpoint ID 219" state=waiting-for-identity subsys=endpoint-manager time="2025-04-30T16:54:27Z" level=info msg="Removed endpoint" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=endpoint time="2025-04-30T16:54:27Z" level=warning msg="Ignoring error while deleting endpoint" endpointID=219 error="Unable to delete key 192.168.7.61:0 from /sys/fs/bpf/tc/globals/cilium_lxc: unable to delete element 192.168.7.61:0 from map cilium_lxc: delete: key does not exist" subsys=daemon time="2025-04-30T16:54:27Z" level=warning msg="Error changing endpoint identity" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 error="unable to resolve identity: exponential backoff cancelled via context: context canceled" identityLabels="k8s:io.cilium.k8s.namespace.labels.addonmanager.kubernetes.io/mode=Reconcile,k8s:io.cilium.k8s.namespace.labels.control-plane=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.azure.com/managedby=aks,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/cluster-service=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=metrics-server,k8s:io.kubernetes.pod.namespace=kube-system,k8s:k8s-app=metrics-server,k8s:kubernetes.azure.com/managedby=aks" ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=endpoint

There is attempt to delete the endpoint but it doesn't exist in the map.. so the error is later ignored and it keeps reattempting to create endpoint again.

When I delete the broken pods or restart cilium the issue goes away and endpoints are restored

Can we investigate why the pod status is Unknown ? Concern is we are now restarting CA two times to make the test happy ?. There might be an underlying issue which we are missing here.

somehow the old endpoints/states are not all reset.

Weird looks like an issue in CA then.

yes, will investigate further. marking this change as draft for now

-Original file line number
+Diff line change
@@ Expand Up / @@ -34,8 +34,17 @@ steps: @@
               done
             fi
+            # Restart cilium if it is installed, bpf maps and endpoint states can be stale after a node restart (versions < v1.17)
+            if [ ${{ parameters.cni }} = 'cilium' ]; then
+              echo "Restart Cilium and ensure it is ready and available. "
+              kubectl rollout restart ds -n kube-system cilium
+              kubectl rollout status ds -n kube-system cilium
+              kubectl get pods -n kube-system -l k8s-app=cilium -owide
+            fi
             echo "Ensure Load-Test deployment pods are marked as ready"
             kubectl rollout status deploy -n load-test
         name: "RestartNodes"
         displayName: "Restart Nodes"
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: restart cilium ds after node restart #3621

Diff view

Diff view

There are no files selected for viewing

vipul-21 May 1, 2025

camrynl May 1, 2025

vipul-21 May 2, 2025

camrynl May 2, 2025

ci: restart cilium ds after node restart #3621

Are you sure you want to change the base?

ci: restart cilium ds after node restart #3621

Diff view

Diff view

There are no files selected for viewing

vipul-21 May 1, 2025

Choose a reason for hiding this comment

camrynl May 1, 2025

Choose a reason for hiding this comment

vipul-21 May 2, 2025

Choose a reason for hiding this comment

camrynl May 2, 2025

Choose a reason for hiding this comment