Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

basic sampling #686

Merged
merged 14 commits into from
Apr 17, 2024
Merged

basic sampling #686

merged 14 commits into from
Apr 17, 2024

Conversation

eroell
Copy link
Collaborator

@eroell eroell commented Apr 9, 2024

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked. Part of Bias detection module #647
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes
New function ehrapy.pp.sampling(...) based on imbalanced-learn

Technical details
At the moment, supports RandomUnderSampler and RandomOverSampler

Additional context
Example:

import ehrapy as ep

adata = ep.data.diabetes_130_fairlearn(columns_obs_only=["age"])
print("distribution of age groups:\n", adata.obs.age.value_counts())
adata_balanced = ep.pp.sample(adata, key="age")
print(
    "distribution of age groups after undersampling:\n",
    adata_balanced.obs.age.value_counts(),
)
distribution of groups:
 age
'Over 60 years'          68541
'30-60 years'            30716
'30 years or younger'     2509

distribution of groups after undersampling:
 age
'30 years or younger'    2509
'30-60 years'            2509
'Over 60 years'          2509

To decide

  • Rather call it sampling, or rather call it balancing?
  • Rather make new index for new AnnData, or keep the old, non-unique indices?
  • Rather keep all computed fields (.varm, .obsm, etc) or discard them?
  • If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)

@eroell eroell requested a review from Zethson April 10, 2024 07:07
@Zethson
Copy link
Member

Zethson commented Apr 10, 2024

Rather call it sampling, or rather call it balancing?

I'm probably more in favor of sampling and then we introduce several options and flavors.

Rather make new index for new AnnData, or keep the old, non-unique indices?

Hmm, non-unique indices sounds like a bad idea

Rather keep all computed fields (.varm, .obsm, etc) or discard them?

We had this discussion before in a different context and we opted for discard, right?

If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)

I'd suggest to keep user-specific information

Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not confident in the naming and how to handle the other fields. Curious what @Lilly-May thinks (later)

docs/usage/usage.md Outdated Show resolved Hide resolved
ehrapy/preprocessing/_sampling.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_sampling.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_sampling.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_sampling.py Outdated Show resolved Hide resolved
ehrapy/preprocessing/_sampling.py Outdated Show resolved Hide resolved
# results computed from data should be recomputed if the data changes
del adata_sampled.obsm
del adata_sampled.varm
del adata_sampled.uns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit nuclear...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if staying somewhat consistent with scanpy: the closest thing, sc.pp.subsample, does not delete any field...
I am somewhat leaning towards that now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's do that. Keep it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lilly-May fine with you? (You can disagree!)

tests/preprocessing/test_sampling.py Outdated Show resolved Hide resolved
tests/preprocessing/test_sampling.py Outdated Show resolved Hide resolved
@Zethson Zethson requested a review from Lilly-May April 10, 2024 08:13
@Lilly-May
Copy link
Collaborator

Rather keep all computed fields (.varm, .obsm, etc) or discard them?

I would also delete them. Otherwise, it's likely that someone forgets to recalculate things and ends up drawing conclusions from the values calculated for the entire dataset, not the subsampled one.

If discard, also discard everything in .var except var name and ehrapy type? There is some things that can end up there by our computations (e.g. highly variable feature statistics), but also user-specific information (e.g. a unit they entered)

I also think ideally we would keep the user-specific variables and discard the ones calculated by ehrapy. How would we implement that though? Having a parameter where the user specifies the variables they would like to keep? Doing it the other way around (having a list of vars calculated by ehrapy and deleting those if present) seems like a challenge to maintain...

@Lilly-May
Copy link
Collaborator

Another thing to consider: The oversampling simply replicates data points, right? Because that will mess up downstream neighbor calculations and thus also UMAP calculations, etc. I don't think there's anything we can do about that except potentially logging a warning for the user?

@eroell
Copy link
Collaborator Author

eroell commented Apr 12, 2024

The oversampling simply replicates data points, right?

yes exactly, the RandomOverSampler from imblearn does only replicate.

Would not raise a Warning everytime a function is used, but add that in the documentation 👍

@Zethson
Copy link
Member

Zethson commented Apr 14, 2024

@eroell feel free to merge it after you've resolved the comments above. I prefer API consistency (name + key behavior) with scanpy here over alternatives for now.

@eroell
Copy link
Collaborator Author

eroell commented Apr 15, 2024

OK - about duplicated indices in adata.obs for the oversampling (or if sampling with replacement):
sc.pp.subsample does not mingle with the indices; for somewhat "consistency", I'd suggest we also don't. (the subsampling in scanpy is always without replacement, so never duplicating indices and hence there this consideration never even is necessary)
Calling reset_index on the adata.obs if needed could by done by the user after oversampling/sampling with replacement.

@eroell
Copy link
Collaborator Author

eroell commented Apr 15, 2024

@eroell feel free to merge it after you've resolved the comments above. I prefer API consistency (name + key behavior) with scanpy here over alternatives for now.

@Lilly-May keeping the things for scanpy consistency reasons + having the information on duplication in the docs agreeable with you? :)

ehrapy/preprocessing/_balanced_sampling.py Outdated Show resolved Hide resolved
@Lilly-May
Copy link
Collaborator

@Lilly-May keeping the things for scanpy consistency reasons + having the information on duplication in the docs agreeable with you? :)

Looks good to me! I think the way it's implemented now is the most intuitive solution for users👍🏻

@eroell eroell marked this pull request as ready for review April 17, 2024 14:25
@eroell eroell merged commit 2bb86b1 into theislab:main Apr 17, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants