-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
basic sampling #686
basic sampling #686
Conversation
I'm probably more in favor of sampling and then we introduce several options and flavors.
Hmm, non-unique indices sounds like a bad idea
We had this discussion before in a different context and we opted for discard, right?
I'd suggest to keep user-specific information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not confident in the naming and how to handle the other fields. Curious what @Lilly-May thinks (later)
ehrapy/preprocessing/_sampling.py
Outdated
# results computed from data should be recomputed if the data changes | ||
del adata_sampled.obsm | ||
del adata_sampled.varm | ||
del adata_sampled.uns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit nuclear...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if staying somewhat consistent with scanpy: the closest thing, sc.pp.subsample
, does not delete any field...
I am somewhat leaning towards that now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's do that. Keep it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lilly-May fine with you? (You can disagree!)
Co-authored-by: Lukas Heumos <[email protected]>
Co-authored-by: Lukas Heumos <[email protected]>
Co-authored-by: Lukas Heumos <[email protected]>
I would also delete them. Otherwise, it's likely that someone forgets to recalculate things and ends up drawing conclusions from the values calculated for the entire dataset, not the subsampled one.
I also think ideally we would keep the user-specific variables and discard the ones calculated by ehrapy. How would we implement that though? Having a parameter where the user specifies the variables they would like to keep? Doing it the other way around (having a list of vars calculated by ehrapy and deleting those if present) seems like a challenge to maintain... |
Another thing to consider: The oversampling simply replicates data points, right? Because that will mess up downstream neighbor calculations and thus also UMAP calculations, etc. I don't think there's anything we can do about that except potentially logging a warning for the user? |
yes exactly, the Would not raise a Warning everytime a function is used, but add that in the documentation 👍 |
@eroell feel free to merge it after you've resolved the comments above. I prefer API consistency (name + key behavior) with scanpy here over alternatives for now. |
OK - about duplicated indices in |
@Lilly-May keeping the things for scanpy consistency reasons + having the information on duplication in the docs agreeable with you? :) |
Looks good to me! I think the way it's implemented now is the most intuitive solution for users👍🏻 |
PR Checklist
docs
is updatedDescription of changes
New function
ehrapy.pp.sampling(...)
based on imbalanced-learnTechnical details
At the moment, supports
RandomUnderSampler
andRandomOverSampler
Additional context
Example:
To decide