Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two blocks persistently failed to migrate to phys03 #115

Open
belforte opened this issue Nov 27, 2024 · 10 comments
Open

two blocks persistently failed to migrate to phys03 #115

belforte opened this issue Nov 27, 2024 · 10 comments

Comments

@belforte
Copy link
Member

blocks

/JPsiToMuMu_PT-0to100_pythia8-gun/Run3Winter24MiniAOD-KeepSi_133X_mcRun3_2024_realistic_v8-v2/MINIAODSIM#699b90af-18b1-4ccc-a9d3-3426b2822b15
/JPsiToMuMu_PT-0to100_pythia8-gun/Run3Winter24MiniAOD-KeepSi_133X_mcRun3_2024_realistic_v8-v2/MINIAODSIM#ff7e1305-3415-4bd7-8be3-14ead7bf0906

when trying to migrate in CRAB Publisher the migration request always fails with (status=9)
In such cases the Publisher deletes the migration and submits a new one, which solves rare issues related to e.g. servers restarts while migrations were going on).
But for those thow blocks any new request keeps failing. Since more than ten days.

Can you look in logs from your side ?

Latest example:

2024-11-27 14:50:26,509:INFO: Submitting 2 block migration requests to DBS3 ...
2024-11-27 14:50:26,509:INFO: Submitting migration request for block /JPsiToMuMu_PT-0to100_pythia8-gun/Run3Winter24MiniAOD-KeepSi_133X_mcRun3_2024_realistic_v8-v2/MINIAODSIM#ff7e1305-3415-4bd7-8be3-14ead7bf0906 ...
2024-11-27 14:50:26,965:ERROR: Migration request refused by server.
2024-11-27 14:50:26,965:ERROR: Migration refusal reason: migration request /JPsiToMuMu_PT-0to100_pythia8-gun/Run3Winter24MiniAOD-KeepSi_133X_mcRun3_2024_realistic_v8-v2/MINIAODSIM#ff7e1305-3415-4bd7-8be3-14ead7bf0906 is already exist in DB with id=4889789
2024-11-27 14:50:27,043:INFO: Existing migration id=4889789 is terminally failed (status=9)Delete it and try again at next iteration
2024-11-27 14:50:27,439:INFO: Submitting migration request for block /JPsiToMuMu_PT-0to100_pythia8-gun/Run3Winter24MiniAOD-KeepSi_133X_mcRun3_2024_realistic_v8-v2/MINIAODSIM#699b90af-18b1-4ccc-a9d3-3426b2822b15 ...
2024-11-27 14:50:27,859:ERROR: Migration request refused by server.
2024-11-27 14:50:27,859:ERROR: Migration refusal reason: migration request /JPsiToMuMu_PT-0to100_pythia8-gun/Run3Winter24MiniAOD-KeepSi_133X_mcRun3_2024_realistic_v8-v2/MINIAODSIM#699b90af-18b1-4ccc-a9d3-3426b2822b15 is already exist in DB with id=4889788
2024-11-27 14:50:27,931:INFO: Existing migration id=4889788 is terminally failed (status=9)Delete it and try again at next iteration
2024-11-27 14:50:28,399:INFO: 2 block migration requests failed to be submitted.
@belforte
Copy link
Member Author

belforte commented Nov 27, 2024

I looked in logs myself and found this for the second migration ID in above sample: 4889788

[2024-11-27 13:23:14.967700536 +0000 UTC m=+1377063.147416859] migration_server.go:72: process {MIGRATION_REQUEST_ID:4889788 MIGRATION_URL:https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader MIGRATION_INPUT:/JPsiToMuMu_PT-0to100_pythia8-gun/Run3Winter24MiniAOD-KeepSi_133X_mcRun3_2024_realistic_v8-v2/MINIAODSIM#699b90af-18b1-4ccc-a9d3-3426b2822b15 MIGRATION_STATUS:1 MIGRATION_SERVER:dbs2go-phys03-migration-89b8c48d6-6xvxl CREATE_BY:/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch CREATION_DATE:1732713686 LAST_MODIFIED_BY:/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch LAST_MODIFICATION_DATE:1732713686 RETRY_COUNT:3}
[2024-11-27 13:23:14.968042727 +0000 UTC m=+1377063.147759050] migrate.go:979: process migration request 4889788
[2024-11-27 13:23:14.968236299 +0000 UTC m=+1377063.147952623] migration_requests.go:133: process migration request 4889788
[2024-11-27 13:23:14.969727133 +0000 UTC m=+1377063.149443444] migrate.go:985: {MIGRATION_REQUEST_ID:4889788 MIGRATION_URL:https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader MIGRATION_INPUT:/JPsiToMuMu_PT-0to100_pythia8-gun/Run3Winter24MiniAOD-KeepSi_133X_mcRun3_2024_realistic_v8-v2/MINIAODSIM#699b90af-18b1-4ccc-a9d3-3426b2822b15 MIGRATION_STATUS:1 MIGRATION_SERVER:dbs2go-phys03-migration-89b8c48d6-6xvxl CREATE_BY:/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch CREATION_DATE:1732713686 LAST_MODIFIED_BY:/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch LAST_MODIFICATION_DATE:1732713686 RETRY_COUNT:3}
[2024-11-27 13:23:14.970099696 +0000 UTC m=+1377063.149816020] migrate.go:1378: update migration request 4889788 to status 1
[2024-11-27 13:23:15.509599491 +0000 UTC m=+1377063.689315803] migrate.go:1116: insert bulkblocks for mid 4889788 error DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/mc/Run3Winter24Reco/JPsiToMuMu_PT-0to100_pythia8-gun/AODSIM/KeepSi_133X_mcRun3_2024_realistic_v8-v2/2540000/deab1d58-e1f5-4498-8642-da601a52cb51.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set
[2024-11-27 13:23:15.510600838 +0000 UTC m=+1377063.690317150] migrate.go:1378: update migration request 4889788 to status 3
[2024-11-27 13:23:15.580274568 +0000 UTC m=+1377063.759990882] migrate.go:1131: updated migration request 4889788 with status 3
[2024-11-27 13:23:15.580662635 +0000 UTC m=+1377063.760378948] migration_server.go:93: migration process map[migration_request_id:4889788 migration_request_url:https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader] finished in 612.607899ms
[2024-11-27 13:23:16.900570515 +0000 UTC m=+4410873.072706557] migration_server.go:72: process {MIGRATION_REQUEST_ID:4889788 MIGRATION_URL:https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader MIGRATION_INPUT:/JPsiToMuMu_PT-0to100_pythia8-gun/Run3Winter24MiniAOD-KeepSi_133X_mcRun3_2024_realistic_v8-v2/MINIAODSIM#699b90af-18b1-4ccc-a9d3-3426b2822b15 MIGRATION_STATUS:1 MIGRATION_SERVER:dbs2go-phys03-migration-56cbc7d65c-56qxx CREATE_BY:/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch CREATION_DATE:1732713686 LAST_MODIFIED_BY:/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch LAST_MODIFICATION_DATE:1732713686 RETRY_COUNT:4}
[2024-11-27 13:23:16.900893794 +0000 UTC m=+4410873.073029839] migrate.go:1378: update migration request 4889788 to status 9

@belforte
Copy link
Member Author

culprit seems to be this

Message:unable to find parent lfn /store/mc/Run3Winter24Reco/JPsiToMuMu_PT-0to100_pythia8-gun/AODSIM/KeepSi_133X_mcRun3_2024_realistic_v8-v2/2540000/deab1d58-e1f5-4498-8642-da601a52cb51.root

@belforte
Copy link
Member Author

belforte commented Nov 27, 2024

that makes little sense since that file exists in DBS with valid status

https://cmsweb.cern.ch/das/request?instance=prod/global&input=block+file%3D%2Fstore%2Fmc%2FRun3Winter24Reco%2FJPsiToMuMu_PT-0to100_pythia8-gun%2FAODSIM%2FKeepSi_133X_mcRun3_2024_realistic_v8-v2%2F2540000%2Fdeab1d58-e1f5-4498-8642-da601a52cb51.root

or in short

belforte@lxplus826/~> dasgoclient --query 'file file=/store/mc/Run3Winter24Reco/JPsiToMuMu_PT-0to100_pythia8-gun/AODSIM/KeepSi_133X_mcRun3_2024_realistic_v8-v2/2540000/deab1d58-e1f5-4498-8642-da601a52cb51.root | grep file.is_file_valid'
1  
belforte@lxplus826/~> 

something wrong in DBS code ?

full error message seems to complain about SQL but it is possibly a consequence of the file not found

[2024-11-27 13:23:15.509599491 +0000 UTC m=+1377063.689315803] migrate.go:1116: insert bulkblocks for mid 4889788 error DBSError Code:101 Description:DBS DB error Function:dbs.bulkblocks.InsertBulkBlocksConcurrently Message:unable to find parent lfn /store/mc/Run3Winter24Reco/JPsiToMuMu_PT-0to100_pythia8-gun/AODSIM/KeepSi_133X_mcRun3_2024_realistic_v8-v2/2540000/deab1d58-e1f5-4498-8642-da601a52cb51.root Error: nested DBSError Code:103 Description:DBS DB query error, e.g. mailformed SQL statement Function:dbs.GetID Message: Error: sql: no rows in result set

that line is in vocms0755 : /cephfs/product/dbs-logs/dbs2go-phys03-migration-56cbc7d65c-56qxx.log-20241127

@todor-ivanov
Copy link

hi @belforte looking into it

@belforte
Copy link
Member Author

belforte commented Dec 6, 2024

hi @todor-ivanov
any hint on what may be going on ? Should I conclude that those migrations are impossible and stop CRAB Publisher from trying ? Did you try a "simple" restart of the migration server ? All in all this never happened after the initial shacking/debugging. It is is a "once a year" thing, we can live with it.

@todor-ivanov
Copy link

todor-ivanov commented Dec 6, 2024

hi @belforte

I was not able to spot anything more than what you were already reporting. The parent lfn is indeed in DBS. As a last resort, I restarted the service few minutes ago, as you suggested. Please let me know if the error persists. I am also watching the logs on the dbs-phys03-migration server side.

@belforte
Copy link
Member Author

belforte commented Dec 6, 2024

thanks Todor, I will let you know how it goes in next iteration

@belforte
Copy link
Member Author

belforte commented Dec 9, 2024

still failing. I will force Publisher to ignore the corresponding requests

@todor-ivanov
Copy link

Hi @belforte how many were those?

@belforte
Copy link
Member Author

belforte commented Dec 9, 2024

two tasks. retrying sine beginning of November. Publication in DBS is the default for CRAB, in this case chances are that the user does not even care. Yet it is the first time that we have a "permanently failed" migration in all of DBS history. Until now those could always be recovered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants