Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harmonize feature type detection #701

Closed
12 tasks done
Lilly-May opened this issue Apr 20, 2024 · 1 comment · Fixed by #724
Closed
12 tasks done

Harmonize feature type detection #701

Lilly-May opened this issue Apr 20, 2024 · 1 comment · Fixed by #724
Assignees
Labels
enhancement New feature or request

Comments

@Lilly-May
Copy link
Collaborator

Lilly-May commented Apr 20, 2024

Description of feature

We currently have three issues (#637, #649, #662) dealing with how we determine feature types and the downstream problems that arise due to different approaches. I'll provide a summary of these issues and associated ToDos here, while closing the other three issues.

Problem:
We've introduced PR #697 to accurately determine feature types in ehrapy. With the new method ep.ad.infer_feature_types, feature types are guessed based on predefined rules and we prompt the user to review these annotations. Currently, feature determination occurs at multiple stages inconsistently and is saved in the adata at several places. Ideally, we would harmonize ehrapy to use exclusively use ep.ad.infer_feature_types for feature annotation, eliminating guesswork in downstream analyses. This means that the new method would be part of the standard preprocessing steps.

ToDos:

  • Update encoding: The autodetect option of encoding here relies on the adata.var[EHRAPY_TYPE_KEY] tag, which is set when (1) reading a dataframe here or (2) moving something from obs to X here. Ideally, we would get rid of the adata.var[EHRAPY_TYPE_KEY] annotation entirely and, if autodetect is set to True in the encoding method, base the identification of features to encode on the annotation from ep.ad.infer_feature_types.
  • Ensure that the feature type annotations are not stored elsewhere. For example, during encoding, we store the information in adata.uns["var_to_encoding"] and adata.uns["encoding_to_var"].
  • Update CohortTracker to be based on ep.ad.infer_feature_types.
  • Update ep.tl.rank_features_groups to be based on ep.ad.infer_feature_types.
  • Update all methods in adata_ext to be based on ep.ad.infer_feature_types.
  • Ensure functionality for both encoded and unencoded data, so it can be used when loading an encoded dataset (I don't know why this wouldn't be the case, still I want to double-check).
  • Improve date detection in ep.ad.infer_feature_types to also work with date(time)s stored as strings and update the FHIR tutorial accordingly.
  • Decide how to deal with ordinal categorical features: Should we have explicit annotations for ordinal categorical data? In downstream analyses, these could either be treated as a continuous integer-scaled feature, or as a nominal class feature, depending on the context and analysis. (see Distinguish between ordinal and nominal categorical feature types #713)
  • Look into whether we can combine this with the feature type detection of TableOne here
  • Discuss whether we want to automatically incorporate the feature type detection into the dataset-loaders (i.e. without the user specifically calling it, which would be required to maintain the encode parameter for the data loaders)
  • Take care of df_to_anndata, which also does a lot of type inference.
  • If we don't fully automate the feature type detection, it might be good to add that to the dataloaders that we maintain
@eroell
Copy link
Collaborator

eroell commented Apr 22, 2024

Nice, thanks for the summary and binging this together here!

Ideally, we would get rid of the adata.var[EHRAPY_TYPE_KEY] annotation entirely and, if autodetect is set to True in the encoding method, base the identification of features to encode on the annotation from ep.ad.infer_feature_types.

This is not what currently is suggested in #697 right?

Also, I think this could lead to some hard to resolve issues if it is not stored but repeatedly called: e.g. labels often encountered are True/False or 0/1, or yes/no: in the 0/1 case, type inference likely infers that to continuous. And 0/1 for sure sometimes would be wanted to be categorical, and sometimes to be continuous

-> users would want to switch the annotated type sometimes for sure, which wouldnt be doable with on-the-fly type inference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants