-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EvoTrees consumes too much memory and crashes for sparse matrices #87
Comments
EvoTrees is effectively designed around dense matrix, so the poor performance on sparse matrix is unfortunately expected. For clarification, my understanding is that for your use case, the data is sparse as a result of one-hot encoding categorical features with large number of modalities as opposed to being intrinsically sparse from a large number of different features that are sparsely populated, is that correct? Although later case appears trickier to adapt to, I think support for categorical feature would be a reasonable feature to add. It would involve passing such features as Integers (so a compact "one-cold" representation as opposed to "one-hot"). A treatment similar to what catboost/lightgbm does would then be applied. This won't be solved by today though :) Suggestions/comments on the proper approach are welcomed! |
Can you point out a document explaining how these algorithms handle "categorical" features? |
Sure, here are 2 docs about it with CatBoost: For lightgbm: https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features, though I haven't access to the referenced 1958 paper :) There are a couple of ways around it. One can be to simply consider the category index to be treated as a numeric feature. This obviously result in arbitrary grouping of categories, but can nonetheless be valuable. Otherwise, main idea is to have groupings based on target related metrics. For example, a mapping of index into numeric features based on credibility adjusted observed target (so as to limit overfit otherwise associated with straight average of the target). So the lazy turnaround to use the category encoding as a numeric input can already be used right now. As for a fancier handling of the groupings, the credibility / regularized target values appears to me as the lowest hanging fruit (similar to the CatBoost explainer), though I'm unclear whether some alternative approach been proven worth the implementation effort. |
Yes, that is correct.
I bet!
I totally agree with this approach - either integers or Categorical Arrays, the rest of the transformations should happen inside the GBM library itself, not like in XGBoost. |
More articles on the subject: https://catboost.ai/docs/concepts/educational-materials-papers.html We could use https://contrib.scikit-learn.org/category_encoders/_modules/category_encoders/cat_boost.html#CatBoostEncoder as a reference implementation. This package |
Thanks @pgagarinov. Cross referenced here: JuliaAI/MLJModels.jl#375 |
Given the improvements to memory footprint throughout various releases since the issue was open and the added support for categorical data through the new Tables API since v0.16, I'd close the PR. Don't hesitate to reopen if the mentioned concerned are still an issue. |
Let us consider the real ML problem (I cannot show the feature names but each row is a separate feature) with 600k observations:
As we can see we have many categorical features with approximately 1500 categories for each feature on average.
Let us approximate real features with random one-hot encoded features represented by sparse matrices and try to apply both XGBoost and EvoTrees:
As we can see XGBoost handles very sparse data very well while EvoTrees allocates too much memory and crashes with OOM error.
The notebook is attached
julia_evotrees.zip
The code to reproduce as the plain script:
The text was updated successfully, but these errors were encountered: