CCM LoadBalancer flake in 5,000 node job #753

BenTheElder · 2024-08-13T16:08:03Z

In a 5k node CI job we have a test failure that seems to be related to loadbalancer controller in CCM failing to handle an unexpected GCP api error (Thanks @danwinship for digging into this here: https://kubernetes.slack.com/archives/CN0K3TE2C/p1723560482393589?thread_ts=1723493683.959229&cid=CN0K3TE2C)

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488

->
https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/artifacts/master-and-node-logs.link.txt
->
https://gcsweb.k8s.io/gcs/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/
->
https://storage.googleapis.com/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-correctness/1823042209046335488/gce-scale-cluster-master/cloud-controller-manager.log

Per @danwinship :

The CCM log shows a 502 error from a cloud API at 17:42:42.656992, and then shows
E0812 18:42:37.300236 11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-9672, affinity-lb-esipp-transition, a1b2bc4622d1041aeabe57d2c40cd9bd, us-east1), err: failed to create forwarding rule for load balancer (a1b2bc4622d1041aeabe57d2c40cd9bd(loadbalancers-9672/affinity-lb-esipp-transition)): context deadline exceeded
an hour later (not clear if that's triggered by the e2e test doing cleanup or a separate identical timeout)
So this looks like cloud-provider-gcp failing to handle an unexpected google cloud api error

/sig scalability
/sig cloud-provider

k8s-ci-robot · 2024-08-13T16:08:11Z

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

danwinship · 2024-08-13T23:15:19Z

The CCM log shows a 502 error from a cloud API at 17:42:42.656992

I didn't want to paste the whole thing into slack, but:

E0812 17:42:42.656992      11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-8109, lb-finalizer, a6fd83aa051064545afde22320536931, us-east1), err: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
E0812 17:42:42.657030      11 controller.go:298] error processing service loadbalancers-8109/lb-finalizer (retrying with exponential backoff): failed to ensure load balancer: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
I0812 17:42:42.657074      11 event.go:389] "Event occurred" object="loadbalancers-8109/lb-finalizer" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message=<
	Error syncing load balancer: failed to ensure load balancer: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>
	<html lang=en>
	  <meta charset=utf-8>
	  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
	  <title>Error 502 (Server Error)!!1</title>
	  <style>
	    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
	  </style>
	  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
	  <p><b>502.</b> <ins>That’s an error.</ins>
	  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>
 >

aojea · 2024-08-16T16:50:44Z

the test has one hour timeout for large clusters, it is not able to provision the loadbalancer in one hour and fail loadbalancers-9672 (edited)

the error at 502 error from a cloud API at 17:42:42.656992 is from other loadbalancer loadbalancers-8109 (edited)

E0812 17:42:42.656992      11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-8109, lb-finalizer, a6fd83aa051064545afde22320536931, us-east1), err: failed to create forwarding rule for load balancer (a6fd83aa051064545afde22320536931(loadbalancers-8109/lb-finalizer)): googleapi: got HTTP response code 502 with body: <!DOCTYPE html>

at 18:42 context start to be cancelled

E0812 18:42:37.300236      11 gce_loadbalancer.go:206] Failed to EnsureLoadBalancer(gce-scale-cluster, loadbalancers-9672, affinity-lb-esipp-transition, a1b2bc4622d1041aeabe57d2c40cd9bd, us-east1), err: failed to create forwarding rule for load balancer (a1b2bc4622d1041aeabe57d2c40cd9bd(loadbalancers-9672/affinity-lb-esipp-transition)): context deadline exceeded
E0812 18:42:37.300274      11 controller.go:298] error processing service loadbalancers-9672/affinity-lb-esipp-transition (retrying with exponential backoff): failed to ensure load balancer: failed to create forwarding rule for load balancer (a1b2bc4622d1041aeabe57d2c40cd9bd(loadbalancers-9672/affinity-lb-esipp-transition)): context deadline exceeded

I see someone internally is analysing it , seems something got stuck in GCE at first sight … == infra issue

bowei · 2024-08-17T00:39:59Z

I'm checking with some people on the internal infra to see if there is anything that is happening that is out of the ordinary.

aojea · 2024-08-31T11:30:22Z

@bowei , independently, can we make the controller more resilient to retry or to make the failure more evident?

1 hours timeouts seems a very large operation

k8s-triage-robot · 2024-11-29T11:56:13Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-12-29T12:31:59Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot added sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 13, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 29, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCM LoadBalancer flake in 5,000 node job #753

CCM LoadBalancer flake in 5,000 node job #753

BenTheElder commented Aug 13, 2024

k8s-ci-robot commented Aug 13, 2024

danwinship commented Aug 13, 2024

aojea commented Aug 16, 2024

bowei commented Aug 17, 2024

aojea commented Aug 31, 2024

k8s-triage-robot commented Nov 29, 2024

k8s-triage-robot commented Dec 29, 2024

CCM LoadBalancer flake in 5,000 node job #753

CCM LoadBalancer flake in 5,000 node job #753

Comments

BenTheElder commented Aug 13, 2024

k8s-ci-robot commented Aug 13, 2024

danwinship commented Aug 13, 2024

aojea commented Aug 16, 2024

bowei commented Aug 17, 2024

aojea commented Aug 31, 2024

k8s-triage-robot commented Nov 29, 2024

k8s-triage-robot commented Dec 29, 2024