A collection of Python/SQL scripts for data cleanup, management, and others related to UGA's ArchivesSpace implementation.
Template for making scripts working with the ArchivesSpace API, particularly working with the ArchivesSnake library: GitHub-ArchivesSnake
A test script with different functions testing how to start up a series of background import jobs and monitor their progress
Runs a gamut of different cleanup functions against ArchivesSpace after UGA's migration from Archivists' Toolkit
Checks all archival objects in ArchivesSpace and checks to see what objects are listed as collections, as well as updating ms3789 to change its objects from 'collection' to 'file'
Checks ArchivesSpace against exported barcodes from our container management system to determine which barcodes do not exist in ArchivesSpace. Returns a csv of barcodes not found in ArchivesSpace
Checks how many children a parent object has in ArchivesSpace and if the number of children is equal to or greater than 1000, logs them in a .csv file
Checks a standard list of controlled vocabulary lists and updates ArchivesSpace by deleting or merging values
Checks for parent files (archival objects with children) that did not have Instances or top containers. It's confusing.
Exported all published resources from ASpace as an EAD.xml file, then took those files and looked for any URLs that were present, checked their HTTP status and if it returned an error (anything but 200), then logged the error in a spreadsheet. This was later adapted to make the Check URLs custom report plugin for ArchivesSpace.
Quick and dirty method for comparing agents from our ArchivesSpace staging environment (v 3.1.1) and compares then to our production environment (2.8.1). In run(), first uncomment the first 4 lines and run get_agents() on prod and staging, then run edited_agents() to generate the EDTAGT_DATA.json with all the agents that lost their dates of existence during the upgrade to staging (3.1.1). Make a copy of that file for backup, then run update_does() AFTER UPGRADING prod to 3.1.1 to add the dates of existence back to those agents. See update_agent_does.py for a more user-friendly script
Checks archival objects for specific resources in ArchivesSpace and if they have an 'unknown_container' as the value of their indicator, then it deletes that object
Exports all published resources for every repository in an ArchivesSpace instance and assigns a concatenated version of the identifier as the filename
Finds electronic record accessions and their linked resources
Finds parent archival objects that have 'unknown container' in their identifier and logs it into a spreadsheet
Writes all subjects and agents to a spreadsheet with their title and URI. This script was made as the first step in a project to clean up subjects and agents in ArchivesSpace. The final step is outlined in update_subjects_agents.py
This script gets all the archival objects for ms3000_2e
Retrieves digital object URIs from ArchivesSpace using a preformatted spreadsheet input with titles and dates of archival objects to match to digital objects
Runs several cleanup operations on our data before migration from Archivists' Toolkit to ArchivesSpace. It updates digital objects to change their repository based on their METS Identifier, delete component unique identifier (aka subdivisionIdentifier) content in that field, replaces leftover xml namespaces from note contents, set resource level to file when it is collection, and update specific digital objects to University Archives (repository 6)
Grab subjects from the ArchivesSpace database and their links to all resources and generates a spreadsheet with that info
Gets all archival objects in collections ms3000_1a, ms3000_1b, ms3000_2a, and ms3000_2b that don't have folders in their container instances
Grabs published archival objects with multiple top container instances attached to it
Attempts to run SQL updates on Archivists' Toolkit databases for cleanup before migrating to ArchivesSpace
Publishes all digital object file versions
Publishes all digital objects
Checks all resources in an ArchivesSpace instance and makes an Excel spreadsheet with those without creators
Grabs all top containers with instance types that are either "moving_images" or "audio" for the Russell repository
This runs several cleanup operations on our data for the Russell repository before migration from Archivists' Toolkit to ArchivesSpace. It updates FileVersions Use Statements to Audio-Streaming, changed Digital Object Show attribute to onLoad instead of replace, replaces leftover xml namespaces from note contents, and sets instance and extent types for some stubborn items.
Tests various API endpoints for ArchivesSpace
Tests the various endpoints for the archival_object.rb controller file for ArchivesSpace
Tests the exports from the ArchivesSpace API
Tests resource.rb endpoints for ArchivesSpace
Tests suppression.rb endpoints for ArchivesSpace
Grabs all top containers that have no barcodes and are associated with published archival objects and resources
Unpublishes resources in ArchivesSpace when they have [CLOSED] in the resource id
Takes info from the Dates of Existence and puts them in the Dates field to display when exporting
Provides a command line user interface for comparing agents from our ArchivesSpace staging environment (v 3.1.1) and compares then to our production environment (2.8.1). First, run the command compare agents. It generates 2 JSON files: AGENTS_CACHE.json stores all agents in both environments that have dates of existence; and EDTAGT_DATA.json stores all the agents who lost their dates of existence when upgrading from 2.8.1 to 3.1.1. Using the update does command, the script goes through all the agents in EDTAGT_DATA.json and adds dates of existence back to the now updated production environment (3.1.1).
Updates ArchivesSpace containers from a spreadsheet
Updates all HCTC container indicators to match the ####.###.###abcdetc. format for those beginning with four numbers, aka year
Transfers a series of archival objects from ms3000_2e to ms3000_2f at the top level
This script removes all -'s from Russell resource IDs
Deletes and merges subjects from a spreadsheet
Not every script requires every package as listed in the requirements.txt file. If you need to use a script, check the import statements at the top to see which specific packages are needed.
- ArchivesSnake - Library used for interacting with the ArchivesSpace API
- lxml - Used to parse XML files for evaluating any XML syntax errors and parsing data from downloaded XML files
- mysql - Used to import mysql-connector
- mysqlclient - Used to connect to the ArchivesSpace MySQL database
- mysql-connector-python - Used to connect and detect any connection errors to the ArchivesSpace MySQL database
- openpyxl - Used to create and write an Excel spreadsheet to document data audit report
- PySimpleGUI - The GUI packaged used to make GUIs
- requests - Used to check URLs and get their status codes
-
Download the repository via cloning to your local IDE or using GitHub's Code button and Download as ZIP
-
Run
pip install requirements.txt
or check the import statements for the script you want to run and install those packages -
Create a secrets.py file with the following information:
- An ArchivesSpace admin username (as_un = ""), password (as_pw = "")
- The URLs to your ArchivesSpace staging (as_api_stag = "") and production (as_api = "") API instances
- Variables with their values set to user emails you want to send the report to:
- sendfrom_email = "<send_from_email>"
- sendto_emails = ["<send_to_email>", "<send_to_email>", "<send_to_email>"]
- senderror_emails = ["<send_to_email>", "<send_to_email>"]
- The email server from which you send your email report (email_server = "")
- Your ArchivesSpace's staging database credentials, including username (as_dbstag_un = ""), password (as_dbstag_pw = ""), hostname (as_dbstag_host = ""), database name (as_dbstag_database = ""), and port (as_dbstag_port = "")
-
Run the script as
python3 ASpace_Data_Audit.py
Each script has its own parameters, most not requiring any arguments to run. However, you will want to take time to adjust the script to meet your own needs. For instance, you may want to set up a 'data' and/or 'reports' folder in your code's directory to store exported CSV's, Excel spreadsheets, or any other outputs that are generated from the script. See the Workflow section for more info on what each script does.
- Select which script you would like to run
- Run the script with the following command for python scripts:
python3 <name_of_script.py>
- If there are arguments, make sure to fill out those arguments after the python script name. Most scripts just need the information listed in secrets.py file created in the installation step above.
- If the script is not a python script, but an SQL statement, you can either download the SQL file or copy the code to your local SQL developer environment and run it there.
- Corey Schmidt - Project Management Librarian/Archivist at the University of Georgia Libraries
- ArchivesSpace community
- Kevin Cottrell - GALILEO/Library Infrastructure Systems Architect at the University of Georgia Libraries
- PySimpleGUI