Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate IA Harvesting into Goobi #453

Open
mgeerdsen opened this issue May 30, 2023 · 1 comment
Open

Integrate IA Harvesting into Goobi #453

mgeerdsen opened this issue May 30, 2023 · 1 comment

Comments

@mgeerdsen
Copy link
Contributor

mgeerdsen commented May 30, 2023

The parts that are involved in the IA harvesting pre-date the cloud migration and we should re-evaluate the way it works.

some ideas:

  • possible integration into Goobi workflow
  • better/more transparent error reporting
  • less steps or at least a more continuous design from harvesting to the process creation in Goobi workflow
  • currently there are two different ways to query the IA, we should look into using maybe just one

possible connections to #350 and #451

@aray-wellcome
Copy link

Yeah this sounds like something we should look at.

I was wondering if we should move everything to the IA CLI. It's a bit flakey as it drops the connection on a handful of items from time to time (but I think this is because of IA servers going down). But they're usually easy to rerun. The benefit of having to only support one method of harvest would probably outweigh this.

The errors look like this, just for the record:

stdout: b'2023-05-31 07:30:56,470 - internetarchive.session - DEBUG - no metadata provided for "b32743725", retrieving now.\n2023-05-31 07:30:56,472 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:30:58,919 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /metadata/b32743725 HTTP/1.1" 200 None\n2023-05-31 07:30:58,923 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:31:00,562 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b32743725/b32743725_hocr.html HTTP/1.1" 302 None\n2023-05-31 07:31:00,563 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): ia902600.us.archive.org:443\n2023-05-31 07:31:02,188 - urllib3.connectionpool - DEBUG - https://ia902600.us.archive.org:443 "GET /7/items/b32743725/b32743725_hocr.html HTTP/1.1" 200 None\n2023-05-31 07:31:03,408 - internetarchive.files - INFO - downloaded b32743725/b32743725_hocr.html to /var/scratch/ip-10-50-5-17.eu-west-1.compute.internal/84e4af94-ca4c-4de6-824a-047ec8244648/b32743725_hocr.html\n2023-05-31 07:31:03,410 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:31:05,123 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b32743725/b32743725_jp2.zip HTTP/1.1" 302 None\n2023-05-31 07:31:05,125 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): ia802600.us.archive.org:443\n2023-05-31 07:31:06,045 - urllib3.connectionpool - DEBUG - https://ia802600.us.archive.org:443 "GET /7/items/b32743725/b32743725_jp2.zip HTTP/1.1" 200 151951512\n2023-05-31 07:32:29,953 - internetarchive.files - INFO - downloaded b32743725/b32743725_jp2.zip to /var/scratch/ip-10-50-5-17.eu-west-1.compute.internal/84e4af94-ca4c-4de6-824a-047ec8244648/b32743725_jp2.zip\n2023-05-31 07:32:29,956 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:32:31,452 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b32743725/b32743725_meta.xml HTTP/1.1" 302 None\n2023-05-31 07:32:31,453 - urllib3.connectionpool - DEBUG - Resetting dropped connection: ia802600.us.archive.org\n2023-05-31 07:32:32,060 - urllib3.connectionpool - DEBUG - https://ia802600.us.archive.org:443 "GET /7/items/b32743725/b32743725_meta.xml HTTP/1.1" 200 None\n2023-05-31 07:32:32,061 - internetarchive.files - INFO - downloaded b32743725/b32743725_meta.xml to /var/scratch/ip-10-50-5-17.eu-west-1.compute.internal/84e4af94-ca4c-4de6-824a-047ec8244648/b32743725_meta.xml\n2023-05-31 07:32:32,063 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:32:33,650 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b32743725/b32743725_scandata.xml HTTP/1.1" 302 None\n2023-05-31 07:32:33,652 - urllib3.connectionpool - DEBUG - Resetting dropped connection: ia902600.us.archive.org\n2023-05-31 07:32:34,960 - urllib3.connectionpool - DEBUG - https://ia902600.us.archive.org:443 "GET /7/items/b32743725/b32743725_scandata.xml HTTP/1.1" 200 None\n2023-05-31 07:32:34,962 - internetarchive.files - INFO - downloaded b32743725/b32743725_scandata.xml to /var/scratch/ip-10-50-5-17.eu-west-1.compute.internal/84e4af94-ca4c-4de6-824a-047ec8244648/b32743725_scandata.xml\n2023-05-31 07:32:42,576 - internetarchive.session - DEBUG - no metadata provided for "b3283441x", retrieving now.\n2023-05-31 07:32:42,577 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:32:44,987 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /metadata/b3283441x HTTP/1.1" 200 None\n2023-05-31 07:32:44,991 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:32:46,578 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b3283441x/b3283441x_hocr.html HTTP/1.1" 302 None\n2023-05-31 07:32:46,580 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): ia802600.us.archive.org:443\n2023-05-31 07:32:47,464 - urllib3.connectionpool - DEBUG - https://ia802600.us.archive.org:443 "GET /34/items/b3283441x/b3283441x_hocr.html HTTP/1.1" 200 None\n2023-05-31 07:32:48,211 - internetarchive.files - INFO - downloaded b3283441x/b3283441x_hocr.html to /var/scratch/ip-10-50-5-17.eu-west-1.compute.internal/ae739ce4-2f46-42b5-a38d-648b451ef200/b3283441x_hocr.html\n2023-05-31 07:32:48,213 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): archive.org:443\n2023-05-31 07:32:49,558 - urllib3.connectionpool - DEBUG - https://archive.org:443 "GET /download/b3283441x/b3283441x_jp2.zip HTTP/1.1" 302 None\n2023-05-31 07:32:49,559 - urllib3.connectionpool - DEBUG - Resetting dropped connection: ia802600.us.archive.org\n2023-05-31 07:32:50,227 - url

@mgeerdsen mgeerdsen changed the title Improving IA Harvesting Integrate IA Harvesting into Goobi Mar 26, 2024
@github-project-automation github-project-automation bot moved this to To do in 2024 Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To do
Development

No branches or pull requests

2 participants