This repository provides an easy way to load UK Biobank data. It is composed of a pre-processing script, which converts the UK Biobank data into parquets that are easier to read, and a library that provides different methods to access the data.
To install this package, simply run
pip install ukbiobank-loaders
Please note that python 3.7 or newer is needed.
We will now describe how to use this library. Please note that data can be read from both local directories, and aws s3 directories.
These are the UK Biobank files that are needed in order to run the pre-processing, all saved in the same directory <DATA_FOLDER>:
death.txt
death_cause.txt
gp_clinical.txt
gp_scripts.txt
hesin.txt
hesin_diag.txt
hesin_oper.txt
Additionally, also the withdrawn consent file is needed:
withdrawn_consent.txt
From the terminal, run
update_data.py --raw_dir <DATA_FOLDER> --withdrawn_file <WITHDRAWN_CONSENT_FILE_PATH> --out_dir <OUTPUT_DIR_FOLDER>
The processed data will be saved in a folder named <OUTPUT_DIR_FOLDER>/final
.
We found this process to take about 14 minutes in a pod composed of 4 CPUs and 32GB of RAM. If the process is Killed, it might be because there is not enough RAM available.
This is a simple example on how to use the library. Specific documentation about the methods is given below.
>>> from ukbb_loaders.loaders import load
>>> dl = load.DataLoader(data_dir = "<OUTPUT_DIR_FOLDER>/final")
>>> dl.get_hospital_data("icd10")
date_of_visit source feature value
eid
68 1986-04-22 icd10 N181 1
68 1945-05-03 icd10 N181 1
68 1950-04-03 icd10 N181 1
68 1966-08-07 icd10 N181 1
67 1991-03-12 icd10 N181 1
.. ... ... ... ...
73 NaT icd10 N181 1
48 1997-06-20 icd10 N181 1
48 1945-03-05 icd10 N181 1
48 1956-02-25 icd10 N181 1
48 1981-04-08 icd10 N181 1
def load_lookup(lookup_name: str) -> pd.DataFrame
Loads lookup table.
Arguments:
lookup_name
str - The name of the lookup table to be loaded.
Returns:
(pd.DataFrame)
- The lookup table of interest.
Example: Load lookup of ICD10 diagnosis codes:
load_lookup("ehr_diagnosis_icd10")
def load_mapper(mapper_name: str) -> pd.DataFrame
Loads ontology mapper.
Arguments:
mapper_name
str - The name of the mapper to be loaded.
Returns:
(pd.DataFrame)
- The mapper of interest.
Example: Load mapping from ICD10 codes to Phecodes:
load_mapper("icd10_to_phecodes")
Loaders for versioned UKBB data.
class DataLoader()
def __init__(data_dir: str)
Class for loading UKBB data.
Arguments:
data_dir
str - The path to the directory containing the processed data. Note that on Windows the path must have forward-slashes, e.g. "C:/Users/john/Documents/data_dir"
def get_hospital_data(source: Union[str, List[str]],
level=None,
patient_list: np.ndarray = None) -> pd.DataFrame
Method that fetches hospital data for the UKBB population.
Arguments:
source
str or list - The coding/representation/source we would like to fetch. It needs to be one or more of:icd10
- for fetching all icd10 related diagnoses.icd9
- for fetching all icd9 related diagnoses.opcs3
- for fetching all opcs4 related operational codes.opcs4
- for fetching all opcs4 related operational codes.level
list or string - The level/significance of diagnoses we would like to fetch. It needs to be one or both of:primary
- for fetching only the primary code related to one diagnosis.secondary
- for fetching all the secondary (complementary) codes for one diagnosis.external
- For fetching diagnosis codes from external sources. Defaults to all of them.patient_list
np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.
Returns:
df
pd.DataFrame - A long canonical dataframe with patients as the index and the following columns:- date_of_visit: pandas datetime for each hospital visit
- feature: the different codes used (e.g. the different icd10 codes)
- source: this is relevant to the source the feature is referring to (e.g. icd10)
- value: the occurrence value for each row combination (initially 1.)
def get_death_data(level=None,
patient_list: np.ndarray = None) -> pd.DataFrame
Method that fetches death information for the UKBB population.
Arguments:
level
list or string - The level/significance of deaths we would like to fetch. It needs to be one or both of: primary (main reason of death), secondary. Defaults to both.patient_list
np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.
Returns:
df
pd.DataFrame - A long canonical dataframe with patients as the index and all recorded death information including death date in the right format.
def get_gp_clinical_data(source=None, patient_list: np.ndarray = None)
Method that fetches GP diagnosis information for the UKBB population.
Arguments:
source
str or list - Whether to load read_2, read_3 or both. Defaults to both.patient_list
np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.
Returns:
df
pd.DataFrame - A long canonical dataframe with patients as the index and all recorded gp information including date in the right format.
def get_gp_medication_data(patient_list: np.ndarray = None) -> pd.DataFrame
Method that fetches GP medication data for the UKBB population.
Arguments:
patient_list
np.ndarray - The patients to fetch medication data for. If this is empty, all UKBB patients will be used.
Returns:
df
pd.DataFrame - A canonical long dataframe with patients as the index and features as columns.
This package is developed using the UK Biobank Resource under Application Number 43138.