Skip to content

FEMR_Pipeline_WIP

Suhana Bedi edited this page Jun 12, 2024 · 29 revisions

FEMR Pipeline Details

This document is a comprehensive guide that explains the entire pipeline we use in the Shahlab to train foundation models on structured EHR data. It also enlists the major differences between FEMR version 1 and FEMR version 2. Towards the end, we explore some alternatives to FEMR and discuss their pros and cons.

Table of Contents

  1. Data Flow and Processes
  2. Differences Between FEMR v1 and FEMR v2
  3. Alternatives to FEMR

Data Flow and Processes

Workflow Diagram

Clarity to OMOP CDM v 5.3:

Clarity: This step involves extracting patient data from the Clarity database, which is a comprehensive data warehouse used for reporting and analytics.

Done by Research IT: The extraction and initial processing of data from Clarity are performed by the Research IT team. The data is then converted into the OMOP Common Data Model (CDM) version 5.3 format, a standardized data model designed to enable the systematic analysis of disparate observational databases.

OMOP CDM v 5.3 to Patient Timeline:

The standardized data in the OMOP CDM format is further processed to create a patient timeline, which organizes the data chronologically for each patient. For FEMR v1, this is carried out using FEMRDataset() and for FEMR v2, it is carried out using MEDS.

Differences Between FEMR v1 and FEMR v2

Femr v1 Femr v2
ETL - Perform joins on patients using the relational tables in STARR - Perform joins on patients using the relational tables in STARR
- Transforms source codes to concept codes - Retains original source codes
- Perform 6 data transformations like Move billing codes to the end of each visit, Move all events to after the birth of a patient etc. They can be found here
MEDS - There is NO support in v1 to convert OMOP to MEDS format. - Converts OMOP to MEDS data format using meds_etl_omop (found here) and adds transformations like 'moving billing codes to the end of the day` using femr_stanford_omop_fixer
Featurization - Count Featurizer - Same as FEMR v2 - Count Featurizer - Converts patient codes into a sparse vector v of length l where v_i is the frequency of code i. It can be found here
- Age Featurizer - Same as FEMR v2 - Age Featurizer - Returns the age of the patient at each label. It can be found here
Labels - Uses a custom Label class that returns prediction timestamp and label value (similar to FEMR v2) but not the patient id. It can be found here - Uses the MEDS schema to define labels which also include the patient id. It can be found here
Splits - Performs Train/Validation/Test splits using get_patient_splits_by_idx. It can be found here - Train-Test split based on generate_hash_split. It can be found here
Ontology - Same as FEMR v2 - From the Athena ontology CONCEPT.csv, codes are generated as: vocabulary_id / concept_code followed by their textual definition stored in concept_name
Ease of Build - Hard to build due to lots of complex dependencies - Lots of bugs and performance issues in terms of speed

Alternatives to FEMR

There is a python package called PyHealth for training Foundation Models on EHR. Michael ran some benchmarking tests last year and it was painfully slow because it uses pandas dataframes under the hood. Here are the preliminary results from Michael's analysis.