Skip to content

BenevolentAI/ukbiobank-loaders

Repository files navigation

ukbiobank-loaders

This repository provides an easy way to load UK Biobank data. It is composed of a pre-processing script, which converts the UK Biobank data into parquets that are easier to read, and a library that provides different methods to access the data.

Installation

To install this package, simply run

pip install ukbiobank-loaders

Please note that python 3.7 or newer is needed.

Usage

We will now describe how to use this library. Please note that data can be read from both local directories, and aws s3 directories.

Pre-processing

These are the UK Biobank files that are needed in order to run the pre-processing, all saved in the same directory <DATA_FOLDER>:

death.txt
death_cause.txt
gp_clinical.txt
gp_scripts.txt
hesin.txt
hesin_diag.txt
hesin_oper.txt

Additionally, also the withdrawn consent file is needed:

withdrawn_consent.txt

From the terminal, run

update_data.py --raw_dir <DATA_FOLDER> --withdrawn_file <WITHDRAWN_CONSENT_FILE_PATH> --out_dir <OUTPUT_DIR_FOLDER>

The processed data will be saved in a folder named <OUTPUT_DIR_FOLDER>/final.

We found this process to take about 14 minutes in a pod composed of 4 CPUs and 32GB of RAM. If the process is Killed, it might be because there is not enough RAM available.

Accessing the data

This is a simple example on how to use the library. Specific documentation about the methods is given below.

>>> from ukbb_loaders.loaders import load
>>> dl = load.DataLoader(data_dir = "<OUTPUT_DIR_FOLDER>/final")
>>> dl.get_hospital_data("icd10")
    date_of_visit source feature  value
eid
68     1986-04-22  icd10    N181      1
68     1945-05-03  icd10    N181      1
68     1950-04-03  icd10    N181      1
68     1966-08-07  icd10    N181      1
67     1991-03-12  icd10    N181      1
..            ...    ...     ...    ...
73            NaT  icd10    N181      1
48     1997-06-20  icd10    N181      1
48     1945-03-05  icd10    N181      1
48     1956-02-25  icd10    N181      1
48     1981-04-08  icd10    N181      1

Documentation for ukbb_loaders.loaders

Table of Contents

ukbb_loaders.utilities.util

load_lookup

def load_lookup(lookup_name: str) -> pd.DataFrame

Loads lookup table.

Arguments:

  • lookup_name str - The name of the lookup table to be loaded.

Returns:

  • (pd.DataFrame) - The lookup table of interest.

Example: Load lookup of ICD10 diagnosis codes:

load_lookup("ehr_diagnosis_icd10")

load_mapper

def load_mapper(mapper_name: str) -> pd.DataFrame

Loads ontology mapper.

Arguments:

  • mapper_name str - The name of the mapper to be loaded.

Returns:

  • (pd.DataFrame) - The mapper of interest.

Example: Load mapping from ICD10 codes to Phecodes:

load_mapper("icd10_to_phecodes")

ukbb_loaders.loaders.load

Loaders for versioned UKBB data.

DataLoader Objects

class DataLoader()

__init__

def __init__(data_dir: str)

Class for loading UKBB data.

Arguments:

  • data_dir str - The path to the directory containing the processed data. Note that on Windows the path must have forward-slashes, e.g. "C:/Users/john/Documents/data_dir"

get_hospital_data

def get_hospital_data(source: Union[str, List[str]],
                      level=None,
                      patient_list: np.ndarray = None) -> pd.DataFrame

Method that fetches hospital data for the UKBB population.

Arguments:

  • source str or list - The coding/representation/source we would like to fetch. It needs to be one or more of:
  • icd10 - for fetching all icd10 related diagnoses.
  • icd9 - for fetching all icd9 related diagnoses.
  • opcs3 - for fetching all opcs4 related operational codes.
  • opcs4 - for fetching all opcs4 related operational codes.
  • level list or string - The level/significance of diagnoses we would like to fetch. It needs to be one or both of:
  • primary - for fetching only the primary code related to one diagnosis.
  • secondary - for fetching all the secondary (complementary) codes for one diagnosis.
  • external - For fetching diagnosis codes from external sources. Defaults to all of them.
  • patient_list np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.

Returns:

  • df pd.DataFrame - A long canonical dataframe with patients as the index and the following columns:
    • date_of_visit: pandas datetime for each hospital visit
    • feature: the different codes used (e.g. the different icd10 codes)
    • source: this is relevant to the source the feature is referring to (e.g. icd10)
    • value: the occurrence value for each row combination (initially 1.)

get_death_data

def get_death_data(level=None,
                   patient_list: np.ndarray = None) -> pd.DataFrame

Method that fetches death information for the UKBB population.

Arguments:

  • level list or string - The level/significance of deaths we would like to fetch. It needs to be one or both of: primary (main reason of death), secondary. Defaults to both.
  • patient_list np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.

Returns:

  • df pd.DataFrame - A long canonical dataframe with patients as the index and all recorded death information including death date in the right format.

get_gp_clinical_data

def get_gp_clinical_data(source=None, patient_list: np.ndarray = None)

Method that fetches GP diagnosis information for the UKBB population.

Arguments:

  • source str or list - Whether to load read_2, read_3 or both. Defaults to both.
  • patient_list np.ndarray - The patients to fetch characteristics for. If this is empty, all UKBB patients will be used.

Returns:

  • df pd.DataFrame - A long canonical dataframe with patients as the index and all recorded gp information including date in the right format.

get_gp_medication_data

def get_gp_medication_data(patient_list: np.ndarray = None) -> pd.DataFrame

Method that fetches GP medication data for the UKBB population.

Arguments:

  • patient_list np.ndarray - The patients to fetch medication data for. If this is empty, all UKBB patients will be used.

Returns:

  • df pd.DataFrame - A canonical long dataframe with patients as the index and features as columns.

Acknowledgments

This package is developed using the UK Biobank Resource under Application Number 43138.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages