-
Notifications
You must be signed in to change notification settings - Fork 15
FEMR_Pipeline_WIP
This document is a comprehensive guide that explains the entire pipeline we use in the Shahlab to train foundation models on structured EHR data. It also enlists the major differences between FEMR version 1 and FEMR version 2. Towards the end, we explore some alternatives to FEMR and discuss their pros and cons.
Clarity: This step involves extracting patient data from the Clarity database, which is a comprehensive data warehouse used for reporting and analytics.
Done by Research IT: The extraction and initial processing of data from Clarity are performed by the Research IT team. The data is then converted into the OMOP Common Data Model (CDM) version 5.3 format, a standardized data model designed to enable the systematic analysis of disparate observational databases.
The standardized data in the OMOP CDM format is further processed to create a patient timeline, which organizes the data chronologically for each patient. For FEMR v1, this is carried out using FEMRDataset() and for FEMR v2, it is carried out using MEDS.
Femr v1 | Femr v2 | |
---|---|---|
ETL | - Perform joins on patients using the relational tables in STARR | - Perform joins on patients using the relational tables in STARR |
- Transforms source codes to concept codes | - Retains original source codes | |
- Perform 6 data transformations like Move billing codes to the end of each visit , Move all events to after the birth of a patient etc. They can be found here
|
||
MEDS | - There is NO support in v1 to convert OMOP to MEDS format. | - Converts OMOP to MEDS data format using meds_etl_omop (found here) and adds transformations like 'moving billing codes to the end of the day` using femr_stanford_omop_fixer |
Featurization | - Count Featurizer - Same as FEMR v2 | - Count Featurizer - Converts patient codes into a sparse vector v of length l where v_i is the frequency of code i. It can be found here |
- Age Featurizer - Same as FEMR v2 | - Age Featurizer - Returns the age of the patient at each label. It can be found here | |
Labels | - Uses a custom Label class that returns prediction timestamp and label value (similar to FEMR v2) but not the patient id. It can be found here
|
- Uses the MEDS schema to define labels which also include the patient id. It can be found here |
Splits | - Performs Train/Validation/Test splits using get_patient_splits_by_idx . It can be found here
|
- Train-Test split based on generate_hash_split . It can be found here
|
Ontology | - Same as FEMR v2 | - From the Athena ontology CONCEPT.csv, codes are generated as: vocabulary_id / concept_code followed by their textual definition stored in concept_name
|
Ease of Build | - Hard to build due to lots of complex dependencies | - Lots of bugs and performance issues in terms of speed |
There is a python package called PyHealth for training Foundation Models on EHR. Michael ran some benchmarking tests last year and it was painfully slow because it uses pandas dataframes under the hood. Here are the preliminary results from Michael's analysis.