-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bias detection to preprocessing #690
Conversation
Thank you so much!
If all of these are fast and don't have too many configurable parameters, it's easier to have a one-stop function. There are ways to circumvent the need to regenerate the Pandas DataFrame such as class variables, cached properties, and other ideas but let's assume that we don't need that for now.
Yeah, that sounds fair to me. Ideally, the fewer parameters the better. That's a rule for every function that we implement. Plotting functions can be exceptions.
I'd save all of them and not just the ones that we find potential biases for. We just calculate and the interpretation/usage is up to the user (we provide the tools to do so). So yes
I veto HTML reports because they're annoying to interpret and it'd be much cooler to generate whole reports for this including the cohort tracking etc in the future. This would be an entirely different subproject. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, does this function need to differentiate a bit more between the different feature types that we're also discussing?
Overall, I think that this is quite good. We'll see whether we can integrate more of the other functions into this, but I think that this function is going into the right direction for sure.
Thanks for the review @Zethson! I'll summarize the main things I'm going to change here:
One more thing to think about: Do we also want to offer plots in some way? Would the summary dataframe be the input, or should it work with the values stored in the adata without additional user input? |
I don't think that we need to offer more plots. I'd much rather consider integrating this with @eroell cohort tracking or TableOne down the road. |
Cool! Thoughts on the implementing side of things:
This function will likely be very verbose with many arguments. It is probably easier to have argument defaults consistent, and not interacting with each other? :)
Hm. A bit more involved, no? The feature importance is part of .var if I got this right, since calling a "native" ehrapy function this would be naturally accounted for indeed :) The standardized mean difference: this is quite a lot going into the direction of rank_features_groups - there, we'd even have things for categoricals/continuous stuff already, and cool test statistics, and so on. Do we want yet another one here? If smd is important, it maybe even should be in rank_features_groups, and from there be called here I think |
@eroell you are correct :) Thanks! |
# Conflicts: # ehrapy/preprocessing/__init__.py # ehrapy/tools/feature_ranking/_feature_importances.py
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely! A few minor comments. Thank you very much!
Co-authored-by: Lukas Heumos <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does what it outlines, checked out a bit also with other data than the mimicII.
Neat, does a lot of summary statistics at once really!
Co-authored-by: Eljas Roellin <[email protected]>
for more information, see https://pre-commit.ci
PR Checklist
docs
is updatedDescription of changes
This is a very drafty PR to get the discussion on the bias module started. The general idea is to have one method that calls several submethods:
I thought the usage to be something along the lines of:
I also made some necessary adjustments on the feature importances method. Firstly, I changed the default model from
regression
torf
(Random Forest) as I found this to be a lot more reliable when testing it for the bias detection, so I believe this should be the default. I also added two parameters to include the options to not log during the calculation of feature importances and to return the prediction score (R2 or accuracy).Discussion points
_feature_correlations
and_standardized_mean_differences
) public as well? If not, should I move everything into one big method since otherwise the submethods (e.g., the correlations one) are quite short, and also we would regenerate a pandas DataFrame from the AnnData several times.sensitive_features="all"
. However, I think we can't run the feature importances then because it gets too computationally expensive. Maybe we should add a parameterrun_feature_importances
that is set toFalse
by default whensensitive_features="all"
and set toTrue
by default whensensitive_features
is a list of features?generate_bias_report
?bias_detection
?n_groups_in_feature x var_n matrix
per investigated sensitive feature, so we would have one entry invarm
per sensitive feature. Do we want that? Or do we only save those for which we find potential biases?ToDos
uns
subdictcopy
parameter