Skip to content

03. Data Infrastructure

Galileu Kim edited this page Aug 2, 2023 · 4 revisions

Overview:

The data pipeline is currently contained in the code within the repo. The data pipeline leverages the bookdown package in order to generate documentation on the pipeline, as well as executing it. The main file is therefore the index.Rmd file, which executes the numbered .Rmd files, each file corresponding to a particular step in our data ETL process.

To run the pipeline and render the documentation, please open the index.Rmd file. There, you have the option to either point and click on RStudio using the Build Book command, or enter the bookdown::render_book() command on the console.

If instead, you plan on testing particular sections in the pipeline, you may simply run the 00-setup.Rmd file to load all required packages, and test individual sections.

Data Requirements:

The data required for initiating the Data ETL is located in the data/raw folder. Per World Bank guidelines, this data is not stored on GitHub, but is accessible to collaborators upon request.

The data/raw folder contains the following files:

  1. db_variables.xlsx: master file with all indicators, their definitions and families.
  2. merged_for_residuals-v2.rds: original files, imported from a dta file processed prior to the ETL.
  3. CBIData_Romelli2022.dta: contains CBI data.
  4. 20211118_new_additions_notGov360.dta: GTMI data and other non-Gov360 data.
  5. 20211118_new_additions_notGov360_PMR.dta: ?
  6. group_list.csv: list of countries produced by team.
  7. CLASS.xlsx: income group classification of countries, produced by the World Bank.
  8. WB_countries_Admin0_lowres.geojson: administrative boundaries for countries, produced by the World Bank.
  9. WB_disputed_areas_Admin0_10m.geojson: disputed areas, produced by the World Bank.
Clone this wiki locally