-
Notifications
You must be signed in to change notification settings - Fork 245
ci: restart cilium ds after node restart #3621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
/azp run Azure Container Networking PR |
Azure Pipelines successfully started running 1 pipeline(s). |
@@ -34,8 +34,17 @@ steps: | |||
done | |||
fi | |||
|
|||
# Restart cilium if it is installed, bpf maps and endpoint states can be stale after a node restart (versions < v1.17) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If node is restarting won't that automatically restart Cilium ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes cilium is initially restarted but somehow the old endpoints/states are not all reset.
The validate state scenario is failing and the clusters have 1-2 pods marked Unknown
and their cilium endpoints are left waiting-for-identity
.
In agent logs it shows it seems to be stuck in a create/delete loop.
time="2025-04-30T16:54:27Z" level=warning msg="Cancelled endpoint create request due to receiving endpoint delete request" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=daemon
time="2025-04-30T16:54:27Z" level=warning msg="Unable to release endpoint ID" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x cniAttachmentID="d94a49c4398b79442842dfc1d17793d6a9abf9dc5e618f769a62123fc5d9b0e3:eth0" error="Unable to release endpoint ID 219" state=waiting-for-identity subsys=endpoint-manager
time="2025-04-30T16:54:27Z" level=info msg="Removed endpoint" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=endpoint
time="2025-04-30T16:54:27Z" level=warning msg="Ignoring error while deleting endpoint" endpointID=219 error="Unable to delete key 192.168.7.61:0 from /sys/fs/bpf/tc/globals/cilium_lxc: unable to delete element 192.168.7.61:0 from map cilium_lxc: delete: key does not exist" subsys=daemon
time="2025-04-30T16:54:27Z" level=warning msg="Error changing endpoint identity" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 error="unable to resolve identity: exponential backoff cancelled via context: context canceled" identityLabels="k8s:io.cilium.k8s.namespace.labels.addonmanager.kubernetes.io/mode=Reconcile,k8s:io.cilium.k8s.namespace.labels.control-plane=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.azure.com/managedby=aks,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/cluster-service=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=metrics-server,k8s:io.kubernetes.pod.namespace=kube-system,k8s:k8s-app=metrics-server,k8s:kubernetes.azure.com/managedby=aks" ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=endpoint
There is attempt to delete the endpoint but it doesn't exist in the map.. so the error is later ignored and it keeps reattempting to create endpoint again.
When I delete the broken pods or restart cilium the issue goes away and endpoints are restored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we investigate why the pod status is Unknown
? Concern is we are now restarting CA two times to make the test happy ?. There might be an underlying issue which we are missing here.
somehow the old endpoints/states are not all reset.
Weird looks like an issue in CA then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will investigate further. marking this change as draft for now
Reason for Change:
Update release testing ci. Restart cilium daemonset after node restarts to clean up old states/endpoints before entering state file check.
Issue Fixed:
Requirements:
Notes: