Skip to content

ci: restart cilium ds after node restart #3621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

camrynl
Copy link
Contributor

@camrynl camrynl commented May 1, 2025

Reason for Change:

Update release testing ci. Restart cilium daemonset after node restarts to clean up old states/endpoints before entering state file check.

Issue Fixed:

Requirements:

Notes:

@camrynl camrynl added the ci Infra or tooling. label May 1, 2025
@Copilot Copilot AI review requested due to automatic review settings May 1, 2025 22:17
@camrynl camrynl requested a review from a team as a code owner May 1, 2025 22:17
@camrynl camrynl requested a review from snguyen64 May 1, 2025 22:17
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@camrynl
Copy link
Contributor Author

camrynl commented May 1, 2025

/azp run Azure Container Networking PR

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -34,8 +34,17 @@ steps:
done
fi

# Restart cilium if it is installed, bpf maps and endpoint states can be stale after a node restart (versions < v1.17)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If node is restarting won't that automatically restart Cilium ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes cilium is initially restarted but somehow the old endpoints/states are not all reset.
The validate state scenario is failing and the clusters have 1-2 pods marked Unknown and their cilium endpoints are left waiting-for-identity .

In agent logs it shows it seems to be stuck in a create/delete loop.

time="2025-04-30T16:54:27Z" level=warning msg="Cancelled endpoint create request due to receiving endpoint delete request" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=daemon
time="2025-04-30T16:54:27Z" level=warning msg="Unable to release endpoint ID" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x cniAttachmentID="d94a49c4398b79442842dfc1d17793d6a9abf9dc5e618f769a62123fc5d9b0e3:eth0" error="Unable to release endpoint ID 219" state=waiting-for-identity subsys=endpoint-manager
time="2025-04-30T16:54:27Z" level=info msg="Removed endpoint" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=endpoint
time="2025-04-30T16:54:27Z" level=warning msg="Ignoring error while deleting endpoint" endpointID=219 error="Unable to delete key 192.168.7.61:0 from /sys/fs/bpf/tc/globals/cilium_lxc: unable to delete element 192.168.7.61:0 from map cilium_lxc: delete: key does not exist" subsys=daemon
time="2025-04-30T16:54:27Z" level=warning msg="Error changing endpoint identity" ciliumEndpointName=kube-system/metrics-server-6ddd769d66-tp48x containerID=d94a49c439 containerInterface= datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=219 error="unable to resolve identity: exponential backoff cancelled via context: context canceled" identityLabels="k8s:io.cilium.k8s.namespace.labels.addonmanager.kubernetes.io/mode=Reconcile,k8s:io.cilium.k8s.namespace.labels.control-plane=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.azure.com/managedby=aks,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/cluster-service=true,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=metrics-server,k8s:io.kubernetes.pod.namespace=kube-system,k8s:k8s-app=metrics-server,k8s:kubernetes.azure.com/managedby=aks" ipv4=192.168.7.61 ipv6= k8sPodName=kube-system/metrics-server-6ddd769d66-tp48x subsys=endpoint

There is attempt to delete the endpoint but it doesn't exist in the map.. so the error is later ignored and it keeps reattempting to create endpoint again.

When I delete the broken pods or restart cilium the issue goes away and endpoints are restored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we investigate why the pod status is Unknown ? Concern is we are now restarting CA two times to make the test happy ?. There might be an underlying issue which we are missing here.

somehow the old endpoints/states are not all reset.

Weird looks like an issue in CA then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will investigate further. marking this change as draft for now

@camrynl camrynl marked this pull request as draft May 2, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Infra or tooling.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants