inDexDa - Natural Language Processing of academic papers for dataset identification and indexing.
An Initiative for human-centered Innovation in the Knowledge Sphere of the ETH Library Lab.
This portion of the project uses the web-scraping section of the args.json config file. The two portions of this section are the archives the user wants to scrape for papers and information on all the archives supported by inDexDa.
The user can specify either one or multiple online repositories to scrape from by modifying the number of entries in the archive tag. The syntax is as follows:
{"id": "0x", "archive": "name"}
# x should be an integer between 1-9
# name is the name of the online paper repository, all lowercase, no spaces
This section is for all the required information for each repository's API. The fields must be standardized for all archives, so if fields are not needed for a repository leave them blank.
- All archives require a search query to find papers relating to that term.
- ScienceDirect requires an API key the user must register for themselves (see below) as well as a range of years to search over.
- Other added archives may require more information, so fields may need to be added and the scraping code modified.
- Queries for arXiv should only be a single word
To register a ScienceDirect API Key Application, follow this link: https://dev.elsevier.com/apikey/manage
inDexDa also allows users to use online or local repositories which are not natively supported. To do this, the following must be done:
- Create a scraper class in the
PaperScraper/lib/
folder.- Name of the file with the scraper class should be paper_scrape_name.py` where name is the name of the archive with no spaces or punctuation between words, all letters lowercase.
- Class named
PaperScraperName
where Name is the name of the repository. - Should output a
papers.json
file which contains a list of dicts, each dict containing the title, abstract, authors, category (if available), date of publication (if available) of a paper. The papers.json file should be saved to thePaperScraper/data/archiveName
folder. - Upon being initialized, the new scraper class should scrape the repository and
compile the list of dicts for the papers. This list should then be set to the
class variable
self.papers
.
- From the new file import the class (PaperScraperName) into
scrape.py
- In scrape.py, the databases variable in
scrape_databases
function needs to be updated to include the new scraper. Add a dictionary entry with the key as the name of the repository (all lowercase, no spaces or punctuation) and the value as the name of the scraping class.