ScaleFEx℠ (Scalable Feature Extraction ) is an open-source Python pipeline designed to extract biologically meaningful features from large high-content imaging (HCI) datasets.
Read more about it in the preprint: https://doi.org/10.1101/2023.07.06.547985
- Robust Feature Extraction: Utilizes advanced algorithms to distill complex image data into critical features that drive insights into cellular phenotypes.
- High-Content Imaging Focus: Tailored for large-scale HCI screens, addressing challenges of scale, variability, and high-dimensionality in biomedical imaging.
- Low overhead: leveraging parallelizations over cores, it maximizes any computer resources
- AWS implementation: to scale up even more. Easy to deploy. Cheaper and faster than the state of the art
For a full description on how to run ScaleFEx on AWS, see the Wiki page here
- Git (version control)
- on MacOS, install the XCode command line tools by running
xcode-select --install
interminal
- on MacOS, install the XCode command line tools by running
- Anaconda (package manager)
- Access to your computer's command line (
terminal
on MacOS, Linux;cmd
on Windows)
Follow these steps to set up and run ScaleFEx via command line:
Ensure you have Conda installed, then create a new environment:
conda create --name ScaleFEx python=3.12
conda activate ScaleFEx
Clone the repository and navigate into the main folder:
git clone https://github.com/NYSCF/ScaleFEx.git
cd ScaleFEx
Install the repository package:
pip install .
These steps will set up the environment with all necessary dependencies isolated to ensure everything works smoothly.
To check if all the packages are correctly installed, run this command without modifying the parameters files:
python3 scalefex_main.py
You should be able to visualize the detected single cells cells from the data provided with the code
Navigate to the folder where the repository was cloned and open the parameters.yaml
file to edit it. Once the code is run, a copy of the used parameters will be saved for your records.
NOTE: if you leave the parameters as they are, the code will compute ScaleFEx on the sample dataset provided
-
vector_type: Write 'scalefex' for the feature vector, '' if you want only the preprocessing part (specified below)
-
resource: 'local' for local computation, 'AWS' for cloud computing
-
n_of_workers: 60 ;int, n of workers to use in parallel. If computing on AWS, this parameter will be ignored, as it is fixed in the AWS framework
-
🟦 exp_folder: '/path/to/images/' ;
-
experiment_name: 'exp001' ;this name will be appended to the saved files
-
saving_folder: '/path/to/saving/folder/' ;path to the saving folder
-
🟩 plates: ['1','2'] ;if you want to process a subset of plates, 'all' for all of the ones found in the folder
-
🟥 plate_identifiers: ['Plate',''] ;identifier for finding the plate number; should directly precede and follow the plate identifier (eg for the default values the plate name extracted would be 1: exp_folder/Plate1/*.tiffs)
-
NOTE: The plate identifiers do not need to contain all the strings within the folder, but just the strings that are constant and can identify the plate. The identifiers are used to identify the plate even when the folder patterns are not the same. Eg sometimes folders include time stamps and it would be hard to query the plate wothout imputing all of the folders.
(example: exp_folder/some_strings < identifier1 > Plate < identifier2 > some_other_string/*.tiffs)
-
-
🟧 pattern: 'Images/<Well>f<Site>p<Plane(2)>-<Channel(3)>.' # pattern of the image file: specify all the characters that make up the filepath indicating the location (more details in the wiki)
-
🟪 file_extensions: ['tiff',] ; specify the extensions of the image files
-
channel: ['ch4','ch1', 'ch5', 'ch3', 'ch2'] ;channels to be processed. NOTE: the nuclear channel should be first
-
zstack: False ;Set to True if you have multi-plane images
-
ROI: 150 ;Radius of the crop to be cut around the cell
-
neurite_tracing: '' ;channel where to compute tracing (if any)
-
RNA_channel: 'ch5' ;RNA channel (if any) #set only if you want to compute ScaleFex
-
Mito_channel: 'ch2' ;Mitochpndria channel (if any) #set only if you want to compute ScaleFex
-
downsampling: 1 ;Downsampling ratio
-
QC: True ;True if the user wants to compute a tile-level Quality Control step
-
FFC: True ;True to compute the Flat Field Correction
-
FFC_n_images: 500 ; n of images to be used to produce the background trend image for the Flat Field Correction
-
csv_coordinates: '' ; '' if you don't want to use a pre-computed coordinates file, otherwise, path to the coordinate file. The columns and format of the csv file needs to be as follows, which is the output of the pipeline:
If another code is used to extract the coordinates and the information about the distance is missing, make an empty column called distance
-
segmenting_function: 'Nuclei_segmentation.nuclei_location_extraction' for threholding method
-
save_coordinates: True ; save a csv file with the coordinates for each plate
-
min_cell_size: 200 ;min area of the cell, all the object with smaller area will be removed
-
max_cell_size: 100000 ;max area of the cell, all the object with bigger area will be removed
-
visualization: False ; if true, the segmentation masks of the entire field will be visualizaed (using matplotlib). NOTE: we suggest to visualize the masks for testing, but to turn it off during the processing of large screens
-
visualize_masks: False ; visualize the segmentation mask from each channel. NOTE: we suggest to visualize the masks for testing, but to turn it off during the processing of large screens
-
visualize_crops: False ; visualizes the crop of the cell. This helps setting the best ROI size, but we suggest to visualize the crop for testing, but to turn it off during the processing of large screens
The colored parameters described above are used in parsing data and are used together to search for files in the following way:
where each plate in plates and each extension in file_extensions is substituted to match all possible combinations.
Please consult the Querying Data wiki for more information.
AWS specific parameters
For a full description on how to run ScaleFEx on AWS, see the Wiki page here
- s3_bucket: 'your-bucket'; name of the S3 Bucket storing your images
- nb_subsets: 6; how many machines per plate you want to deploy
- subset_index: 'all'; can use an int to compute a specific subset to compute (i.e:'2')
- region: 'us-east-1'; what region you want to deploy machines into
- instance_type: 'c5.12xlarge' ; Machine type/size
- amazon_image_id: 'ami-06c68f701d8090592' ; AMI linked to region
- ScaleFExSubnetA: 'subnet-XXXXXXXXXXXXXXXXX' ; ARN of the subnet you want to use for machines deployment, empty string if you want to use the default one
- ScaleFExSubnetB: 'subnet-XXXXXXXXXXXXXXXXX' ; second subnet you want to use, if only one use the same
- ScaleFExSubnetC: 'subnet-XXXXXXXXXXXXXXXXX' ; third subnet you want to use, if only one use the same
- security_group_id: 'sg-XXXXXXXXXXXXXXXXX' ; security group you want to use, empty string if you want to use the default one
Execute Analysis:
If running the code locally:
From the terminal:
After setting the parameters of the yaml file and updating the parameter.yaml file name and location within the scalefex_main.py file, navigate to the folder of your code and execute
python3 scalefex_main.py
Alternatively, you can specify the parameter file location calling the code this way:
python scalefex_main.py -p parameters_test.yaml
If you want to deploy ScaleFExSM on a notebook, look at the example described in the Example section
If running the code on AWS: Deploy the 'ScaleFEx_main.yaml' Cloudformation template available here and set your parameters. A detailed guide is available here
A example notebook for running our pipeline on a single field is included here. To run it, make sure to have installed the correct library (on terminal input pip install notebook)
An example of a possible analysis that can be performed on the ScaleFEx features is outlined in demos/demo_scalefex_analysis.ipynb`
The dataset used to validate ScaleFEx in the publication is publicly available:
- Raw imaging data (2.3TB): https://nyscfopensource.blob.core.windows.net/nyscfopensource/scalefex/ScaleFExDataset.zip
- Raw ScaleFEx features from the same dataset, with associated metadata: https://nyscfopensource.blob.core.windows.net/nyscfopensource/scalefex/scalefex_raw_features.parquet
- Processed features (normalized, corrected and uncorrelated fearures) with associated metadata: https://github.com/NYSCF/ScaleFEx/edit/main/README.md#:~:text=ScaleFEx_corrected_averaged_features To retrieve the metadata for the images, download either one file of the ScaleFEx features and merge it with the well and plate information encoded in the filename of each image The channels stains are:
Channel name | Antibody target | Cell target |
---|---|---|
ch1 | Concanavalin A | ER |
ch2 | WGA and Phalloidin | GP |
ch3 | MitoTracker | Mitochondria |
ch4 | Hoechst | Nuclei |
ch5 | SYTO14 | Cytoplasmic RNA |
ScaleFEx℠ is released under the BSD-3-Clause Clear license. For more details read the LICENSE file.