Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should retry some storage errors. #986

Open
gfr10598 opened this issue Apr 13, 2021 · 3 comments
Open

Should retry some storage errors. #986

gfr10598 opened this issue Apr 13, 2021 · 3 comments

Comments

@gfr10598
Copy link
Contributor

We are currently seeing a low rate of GCS storage errors:

2021/04/13 04:54:19 rowwriter.go:119: googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-staging ndt/ndt7/2020/08/27/20200827T170704.505210Z-ndt7-mlab3-lhr05-ndt.tgz.json
textPayload: "2021/04/13 04:54:19 rowwriter.go:119: googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-staging ndt/ndt7/2020/08/27/20200827T170704.505210Z-ndt7-mlab3-lhr05-ndt.tgz.json
"

These would likely succeed on retry.

@autolabel autolabel bot added the review/triage Team should review and assign priority label Apr 13, 2021
@gfr10598
Copy link
Contributor Author

gfr10598 commented Apr 14, 2021

Write failure errors

After adding a retry with a 2 second delay, we are still seeing the same write errors.

@gfr10598
Copy link
Contributor Author

Looks like there is very little retry happening the library. If I add a 20 second delay, and retry, it looks like the initial attempt takes between 0 and 5 seconds - not much retry. The Write retries then fail every 20 seconds - never succeed.

There is then a later failed retry, with fewer rows, likely driven by the Flush prior to Close at the end of the archive. Not clear what happened in between. Will investigate further.

2021/04/15 04:06:48 rowwriter.go:122: Retrying after 347234 of 385862 bytes 10m3s googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-sandbox/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz.json
"2021/04/15 04:07:08 rowwriter.go:122: Retrying after 0 of 965 bytes 20s googleapi: got HTTP response code 503 with body: Service Unavailable etl-mlab-sandbox/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz.json
2021/04/15 04:16:48 task.go:179: Processed 4401 files, 0 nil data, 4359 rows committed, 42 failed, from gs://archive-measurement-lab/ndt/annotation/2020/11/23/20201123T104213.627630Z-annotation-mlab1-ham02-ndt.tgz into annotation

@gfr10598
Copy link
Contributor Author

Tried running many retries, with 20 seconds between. In half a dozen failures, none ever later succeeded. The close also fails.

Checking GCS shows that the corresponding file still exists from a previous parsing, and has not been replaced.

Likely we should abandon the partially written file, probably by cancelling the context that was used to create the object handle?

@laiyi-ohlsen laiyi-ohlsen added pipeline bug and removed review/triage Team should review and assign priority labels Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants