-
Notifications
You must be signed in to change notification settings - Fork 5
03. Data Infrastructure
The data pipeline is currently contained in the code
within the repo. The data pipeline leverages the bookdown package in order to generate documentation on the pipeline, as well as executing it. The main file is therefore the index.Rmd
file, which executes the numbered .Rmd
files, each file corresponding to a particular step in our data ETL process.
To run the pipeline and render the documentation, please open the index.Rmd file
. There, you have the option to either point and click on RStudio
using the Build Book
command, or enter the bookdown::render_book()
command on the console.
If instead, you plan on testing particular sections in the pipeline, you may simply run the 00-setup.Rmd
file to load all required packages, and test individual sections.
The data required for initiating the Data ETL is located in the data/raw
folder. Per World Bank guidelines, this data is not stored on GitHub, but is accessible to collaborators upon request.
The data/raw
folder contains the following files:
-
db_variables.xlsx
: master file with all indicators, their definitions and families. -
merged_for_residuals-v2.rds
: original files, imported from adta
file processed prior to the ETL. -
CBIData_Romelli2022.dta
: contains CBI data. -
20211118_new_additions_notGov360.dta
: GTMI data and other non-Gov360 data. -
20211118_new_additions_notGov360_PMR.dta
: ? -
group_list.csv
: list of countries produced by team. -
CLASS.xlsx
: income group classification of countries, produced by the World Bank. -
WB_countries_Admin0_lowres.geojson
: administrative boundaries for countries, produced by the World Bank. -
WB_disputed_areas_Admin0_10m.geojson
: disputed areas, produced by the World Bank.