basic sampling #686

eroell · 2024-04-09T16:47:07Z

PR Checklist

This comment contains a description of changes (with reason)
Referenced issue is linked. Part of Bias detection module #647
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

Description of changes
New function ehrapy.pp.sampling(...) based on imbalanced-learn

Technical details
At the moment, supports RandomUnderSampler and RandomOverSampler

Additional context
Example:

import ehrapy as ep

adata = ep.data.diabetes_130_fairlearn(columns_obs_only=["age"])
print("distribution of age groups:\n", adata.obs.age.value_counts())
adata_balanced = ep.pp.sample(adata, key="age")
print(
    "distribution of age groups after undersampling:\n",
    adata_balanced.obs.age.value_counts(),
)

distribution of groups:
 age
'Over 60 years'          68541
'30-60 years'            30716
'30 years or younger'     2509

distribution of groups after undersampling:
 age
'30 years or younger'    2509
'30-60 years'            2509
'Over 60 years'          2509

To decide

Rather call it sampling, or rather call it balancing?
Rather make new index for new AnnData, or keep the old, non-unique indices?
Rather keep all computed fields (.varm, .obsm, etc) or discard them?
If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)

Zethson · 2024-04-10T08:07:48Z

Rather call it sampling, or rather call it balancing?

I'm probably more in favor of sampling and then we introduce several options and flavors.

Rather make new index for new AnnData, or keep the old, non-unique indices?

Hmm, non-unique indices sounds like a bad idea

Rather keep all computed fields (.varm, .obsm, etc) or discard them?

We had this discussion before in a different context and we opted for discard, right?

If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)

I'd suggest to keep user-specific information

Zethson

I'm also not confident in the naming and how to handle the other fields. Curious what @Lilly-May thinks (later)

docs/usage/usage.md

ehrapy/preprocessing/_sampling.py

Zethson · 2024-04-10T08:12:00Z

ehrapy/preprocessing/_sampling.py

+    # results computed from data should be recomputed if the data changes
+    del adata_sampled.obsm
+    del adata_sampled.varm
+    del adata_sampled.uns


Seems a bit nuclear...

if staying somewhat consistent with scanpy: the closest thing, sc.pp.subsample, does not delete any field...
I am somewhat leaning towards that now

Then let's do that. Keep it

@Lilly-May fine with you? (You can disagree!)

tests/preprocessing/test_sampling.py

Co-authored-by: Lukas Heumos <[email protected]>

Lilly-May · 2024-04-10T12:50:24Z

Rather keep all computed fields (.varm, .obsm, etc) or discard them?

I would also delete them. Otherwise, it's likely that someone forgets to recalculate things and ends up drawing conclusions from the values calculated for the entire dataset, not the subsampled one.

If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)

I also think ideally we would keep the user-specific variables and discard the ones calculated by ehrapy. How would we implement that though? Having a parameter where the user specifies the variables they would like to keep? Doing it the other way around (having a list of vars calculated by ehrapy and deleting those if present) seems like a challenge to maintain...

Lilly-May · 2024-04-10T12:54:18Z

Another thing to consider: The oversampling simply replicates data points, right? Because that will mess up downstream neighbor calculations and thus also UMAP calculations, etc. I don't think there's anything we can do about that except potentially logging a warning for the user?

eroell · 2024-04-12T15:18:54Z

The oversampling simply replicates data points, right?

yes exactly, the RandomOverSampler from imblearn does only replicate.

Would not raise a Warning everytime a function is used, but add that in the documentation 👍

Zethson · 2024-04-14T12:58:35Z

@eroell feel free to merge it after you've resolved the comments above. I prefer API consistency (name + key behavior) with scanpy here over alternatives for now.

eroell · 2024-04-15T11:51:11Z

OK - about duplicated indices in adata.obs for the oversampling (or if sampling with replacement):
sc.pp.subsample does not mingle with the indices; for somewhat "consistency", I'd suggest we also don't. (the subsampling in scanpy is always without replacement, so never duplicating indices and hence there this consideration never even is necessary)
Calling reset_index on the adata.obs if needed could by done by the user after oversampling/sampling with replacement.

eroell · 2024-04-15T11:52:53Z

@eroell feel free to merge it after you've resolved the comments above. I prefer API consistency (name + key behavior) with scanpy here over alternatives for now.

@Lilly-May keeping the things for scanpy consistency reasons + having the information on duplication in the docs agreeable with you? :)

ehrapy/preprocessing/_balanced_sampling.py

Lilly-May · 2024-04-17T13:57:37Z

@Lilly-May keeping the things for scanpy consistency reasons + having the information on duplication in the docs agreeable with you? :)

Looks good to me! I think the way it's implemented now is the most intuitive solution for users👍🏻

eroell added 2 commits April 9, 2024 18:37

basic sampling

f1d86ea

doc update, maybe not correct yet

5d44089

eroell requested a review from Zethson April 10, 2024 07:07

Zethson reviewed Apr 10, 2024

View reviewed changes

Zethson requested a review from Lilly-May April 10, 2024 08:13

eroell and others added 4 commits April 10, 2024 10:49

Update ehrapy/preprocessing/_sampling.py

14c978a

Co-authored-by: Lukas Heumos <[email protected]>

Update ehrapy/preprocessing/_sampling.py

1a3b6c2

Co-authored-by: Lukas Heumos <[email protected]>

Update ehrapy/preprocessing/_sampling.py

8cf1690

Co-authored-by: Lukas Heumos <[email protected]>

no print in tests

8a94d38

Merge branch 'main' into sampling

26e150e

eroell added 4 commits April 12, 2024 17:44

review comments, more sc.pp.subsample consistent, added copy

2bd5f97

add imbalanced-learn to dependencies

f795172

fix name in doc

253a0bf

fix links?

e8a6e57

docs fixed

837bda5

review-notebook-app bot mentioned this pull request Apr 17, 2024

Add bias detection theislab/ehrapy-tutorials#19

Merged

Lilly-May approved these changes Apr 17, 2024

View reviewed changes

ehrapy/preprocessing/_balanced_sampling.py Outdated Show resolved Hide resolved

eroell added 2 commits April 17, 2024 16:11

missing r

a0f5011

Merge branch 'main' into sampling

ebb19b2

eroell marked this pull request as ready for review April 17, 2024 14:25

eroell merged commit 2bb86b1 into theislab:main Apr 17, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

basic sampling #686

basic sampling #686

eroell commented Apr 9, 2024 •

edited

Loading

Zethson commented Apr 10, 2024

Zethson left a comment

Zethson Apr 10, 2024

eroell Apr 12, 2024

Zethson Apr 12, 2024

Zethson Apr 12, 2024

Lilly-May commented Apr 10, 2024

Lilly-May commented Apr 10, 2024

eroell commented Apr 12, 2024 •

edited

Loading

Zethson commented Apr 14, 2024

eroell commented Apr 15, 2024

eroell commented Apr 15, 2024

Lilly-May commented Apr 17, 2024

basic sampling #686

basic sampling #686

Conversation

eroell commented Apr 9, 2024 • edited Loading

Zethson commented Apr 10, 2024

Zethson left a comment

Choose a reason for hiding this comment

Zethson Apr 10, 2024

Choose a reason for hiding this comment

eroell Apr 12, 2024

Choose a reason for hiding this comment

Zethson Apr 12, 2024

Choose a reason for hiding this comment

Zethson Apr 12, 2024

Choose a reason for hiding this comment

Lilly-May commented Apr 10, 2024

Lilly-May commented Apr 10, 2024

eroell commented Apr 12, 2024 • edited Loading

Zethson commented Apr 14, 2024

eroell commented Apr 15, 2024

eroell commented Apr 15, 2024

Lilly-May commented Apr 17, 2024

eroell commented Apr 9, 2024 •

edited

Loading

eroell commented Apr 12, 2024 •

edited

Loading