We present the Mouse Kidney Atlas (MKA), a comprehensive atlas of cellular heterogeneity in the healthy mouse kidney, which we generated by carefully integrating data from eight publicly available studies. We integrate these datasets using scVI and scANVI. To overcome annotation inconsistencies we learn the relationship between cell type transcriptomic profiles across datasets using scHPL. This model is then able to automatically label unseen cell populations with unprecedented resolution and accuracy. We demonstrate the significance of our atlas by obtaining robust and novel markers for poorly described cell types.
The MKA is publicly available to download, visualize and interact with at cellxgene
For more details refer to: A comprehensive mouse kidney atlas enables rare cell population characterization and robust marker discovery
-
models: Files containing the trained models used in the manuscript
-
notebooks: notebooks used to generate the figures presented in the manuscript
-
QC_scVI_scANVI : Figure 1
-
scHPL_ManualReannotation : Figure 2 and 3
Supplementary Figures 1, 2 and 3
-
scHPL_Evaluation : Figure 4
Supplementary Figure 4 and 5
-
Downstream_analyses : Figure 5
Supplementary Figure 6
-
-
MKA_Metamarkers.xlsx Excel file with the identified metamarkers for each cell type label in the MKA.
- Rank: Overall ranking for this gene within a cell type. The higher the ranking the better the marker is for the given population accounting for batch differences and number of datasets in which the gene is detected.
- AUROC: Area under the receiver-operator curve. This value is an indication of how good the gene is in a classification scenario. For example, Podxl has an AUROC value of 0.9, which means that this gene is very good at classifying Podocytes as such.
-
functions.py helper functions used across the code
-
hyper_tune.py Ray tune implementation to optimize scVI model hyperparameters
If you want to use the models for your own research you will need the HVG-filtered matrix we trained these on. You can find the AnnData object at Zenodo. Once downloaded, you can:
import os
import scvi
import scanpy as sc
os.chdir("MKA")
adata = sc.read_h5ad("adata.h5ad")
atlas_model = scvi.model.SCANVI.load("models/scANVI_model_full", adata=adata)
Ray tune was used train 1000 different hyperparameter and model configurations.
The tracked metrics at each training epoch were 'elbo_validation'
, 'reconstruction_loss'
and 'silhouette_score'
. Batch and cell type silhouette scores computed on the latent space were used as objective functions to maximize during training.
The search space was defined as follows:
- model configuration
- dropout rate: loguniform distribution between
1e-4
and1e-1
- number of layers: random integer between
1
and3
- number of latent dimensions: random integer between
20
and31
- dropout rate: loguniform distribution between
- plan configuration
- learning rate: loguniform distribution between
1e-4
and1e-1
- learning rate: loguniform distribution between
- atlas architecture
- subset: random boolean (
True
/False
).
The purpose of this parameter is to test the effect of filtering the feature space
- number of hvgs: random choice between
2000
and8000
in1000
increments - continious_covariates: random choice between
'pct_counts_mt'
andNone
- categorical_covariates: random choice between
'Source'
andNone
'Source' in this case refers to either nuclei or cell as the starting material
- subset: random boolean (
- number of epochs: random number between
100
and201
The following table contains all studies included in the MKA
Publication | Abbbreviation | Accession number |
---|---|---|
Wu et al., 2019 | Wu19 | GSE119531 |
Miao et al., 2021 | Miao21 | GSE157079 |
Park et al., 2018 | Park18 | GSE107585 |
Kirita et al., 2020 | Kirita20 | GSE139107 |
Dumas et al., 2020 | Dumas20 | E-MTAB-8145 |
Conway et al., 2020 | Conway20 | GSE140023 |
Hinze et al., 2021 | Hinze21 | GSE145690 |
Janosevic et al., 2021 | Janosevic21 | GSE151658 |