Skip to content

03. Data Infrastructure

Galileu Kim edited this page Jul 17, 2023 · 4 revisions

Overview:

The data pipeline is currently contained in the code within the repo. The data pipeline leverages the (bookdown)[https://bookdown.org/yihui/bookdown/html.html] package in order to generate documentation on the pipeline, as well as executing it. The main file is therefore the index.Rmd file, which executes the numbered .Rmd files, each file corresponding to a particular step in our data ETL process.

Data Requirements:

The data required for initiating the Data ETL is located in the data/raw folder. Per World Bank guidelines, this data is not stored on GitHub, but is accessible to collaborators upon request.

The data/raw folder contains the following files:

  1. db_variables.xlsx: master file with all indicators, their definitions and families.
  2. merged_for_residuals-v2.rds: original files, imported from a dta file processed prior to the ETL.
  3. CBIData_Romelli2022.dta: contains CBI data.
  4. 20211118_new_additions_notGov360.dta: GTMI data and other non-Gov360 data.
  5. 20211118_new_additions_notGov360_PMR.dta: ?
  6. group_list.csv: list of countries produced by team.
  7. CLASS.xlsx: income group classification of countries, produced by the World Bank.
  8. WB_countries_Admin0_lowres.geojson: administrative boundaries for countries, produced by the World Bank.
  9. WB_disputed_areas_Admin0_10m.geojson: disputed areas, produced by the World Bank.
Clone this wiki locally