Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCM LoadBalancer flake in 5,000 node job #753

Open
BenTheElder opened this issue Aug 13, 2024 · 7 comments
Open

CCM LoadBalancer flake in 5,000 node job #753

BenTheElder opened this issue Aug 13, 2024 · 7 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@BenTheElder
Copy link
Member

In a 5k node CI job we have a test failure that seems to be related to loadbalancer controller in CCM failing to handle an unexpected GCP api error (Thanks @danwinship for digging into this here: https://kubernetes.slack.com/archives/CN0K3TE2C/p1723560482393589?thread_ts=1723493683.959229&cid=CN0K3TE2C)

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488

->
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/artifacts/master-and-node-logs.link.txt
->
https://gcsweb.k8s.io/gcs/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/
->
https://storage.googleapis.com/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/gce-scale-cluster-master/cloud-controller-manager.log

Per @danwinship :

The CCM log shows a 502 error from a cloud API at 17:42:42.656992, and then shows
E0812 18:42:37.300236 11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-9672, affinity-lb-esipp-transition, a1b2bc4622d1041aeabe57d2c40cd9bd, us-east1), err: failed to create forwarding rule for load balancer (a1b2bc4622d1041aeabe57d2c40cd9bd(loadbalancers-9672/affinity-lb-esipp-transition)): context deadline exceeded
an hour later (not clear if that's triggered by the e2e test doing cleanup or a separate identical timeout)
So this looks like cloud-provider-gcp failing to handle an unexpected google cloud api error

/sig scalability
/sig cloud-provider

@k8s-ci-robot k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 13, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@danwinship
Copy link
Contributor

The CCM log shows a 502 error from a cloud API at 17:42:42.656992

I didn't want to paste the whole thing into slack, but:

E0812 17:42:42.656992      11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-8109, lb-finalizer, a6fd83aa051064545afde22320536931, us-east1), err: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
E0812 17:42:42.657030      11 controller.go:298] error processing service loadbalancers-8109/lb-finalizer (retrying with exponential backoff): failed to ensure load balancer: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
I0812 17:42:42.657074      11 event.go:389] "Event occurred" object="loadbalancers-8109/lb-finalizer" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message=<
	Error syncing load balancer: failed to ensure load balancer: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>
	<html lang=en>
	  <meta charset=utf-8>
	  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
	  <title>Error 502 (Server Error)!!1</title>
	  <style>
	    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
	  </style>
	  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
	  <p><b>502.</b> <ins>That’s an error.</ins>
	  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
 >

@aojea
Copy link
Member

aojea commented Aug 16, 2024

the test has one hour timeout for large clusters, it is not able to provision the loadbalancer in one hour and fail loadbalancers-9672 (edited)

the error at 502 error from a cloud API at 17:42:42.656992 is from other loadbalancer loadbalancers-8109 (edited)

E0812 17:42:42.656992      11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-8109, lb-finalizer, a6fd83aa051064545afde22320536931, us-east1), err: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>

at 18:42 context start to be cancelled

E0812 18:42:37.300236      11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-9672, affinity-lb-esipp-transition, a1b2bc4622d1041aeabe57d2c40cd9bd, us-east1), err: failed to create forwarding rule for load balancer (a1b2bc4622d1041aeabe57d2c40cd9bd(loadbalancers-9672/affinity-lb-esipp-transition)): context deadline exceeded
E0812 18:42:37.300274      11 controller.go:298] error processing service loadbalancers-9672/affinity-lb-esipp-transition (retrying with exponential backoff): failed to ensure load balancer: failed to create forwarding rule for load balancer (a1b2bc4622d1041aeabe57d2c40cd9bd(loadbalancers-9672/affinity-lb-esipp-transition)): context deadline exceeded

I see someone internally is analysing it , seems something got stuck in GCE at first sight … == infra issue

@bowei
Copy link
Member

bowei commented Aug 17, 2024

I'm checking with some people on the internal infra to see if there is anything that is happening that is out of the ordinary.

@aojea
Copy link
Member

aojea commented Aug 31, 2024

@bowei , independently, can we make the controller more resilient to retry or to make the failure more evident?

1 hours timeouts seems a very large operation

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests

6 participants