-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CCM LoadBalancer flake in 5,000 node job #753
Comments
This issue is currently awaiting triage. If the repository mantainers determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
I didn't want to paste the whole thing into slack, but:
|
the test has one hour timeout for large clusters, it is not able to provision the loadbalancer in one hour and fail loadbalancers-9672 (edited) the error at 502 error from a cloud API at 17:42:42.656992 is from other loadbalancer loadbalancers-8109 (edited)
at 18:42 context start to be cancelled
I see someone internally is analysing it , seems something got stuck in GCE at first sight … == infra issue |
I'm checking with some people on the internal infra to see if there is anything that is happening that is out of the ordinary. |
@bowei , independently, can we make the controller more resilient to retry or to make the failure more evident? 1 hours timeouts seems a very large operation |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
In a 5k node CI job we have a test failure that seems to be related to loadbalancer controller in CCM failing to handle an unexpected GCP api error (Thanks @danwinship for digging into this here: https://kubernetes.slack.com/archives/CN0K3TE2C/p1723560482393589?thread_ts=1723493683.959229&cid=CN0K3TE2C)
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488
->
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/artifacts/master-and-node-logs.link.txt
->
https://gcsweb.k8s.io/gcs/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/
->
https://storage.googleapis.com/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/gce-scale-cluster-master/cloud-controller-manager.log
Per @danwinship :
/sig scalability
/sig cloud-provider
The text was updated successfully, but these errors were encountered: