Skip to content

Commit

Permalink
Pushing the docs to 0.4/ for branch: 0.4.X, commit 41cd67ff6c06a91a79…
Browse files Browse the repository at this point in the history
…558e24f85aed057c930064
  • Loading branch information
dirty-cat-ci committed Dec 11, 2024
1 parent 70daa5e commit 3b7dff7
Show file tree
Hide file tree
Showing 127 changed files with 51,771 additions and 51,474 deletions.
2 changes: 1 addition & 1 deletion 0.4/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 17e96c066813edaab4b4a0687e164778
config: 2b7368db1424b555105e09181716ffc1
tags: 645f666f9bcd5a90fca523b33c5a78b7
219 changes: 132 additions & 87 deletions 0.4/CHANGES.html

Large diffs are not rendered by default.

119 changes: 92 additions & 27 deletions 0.4/CONTRIBUTING.html

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions 0.4/RELEASE_PROCESS.html
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
<link rel="preload" as="script" href="_static/scripts/bootstrap.js?digest=26a4bc78f4c0ddb94549" />
<link rel="preload" as="script" href="_static/scripts/pydata-sphinx-theme.js?digest=26a4bc78f4c0ddb94549" />

<script src="_static/documentation_options.js?v=6c02275b"></script>
<script src="_static/documentation_options.js?v=c87aa342"></script>
<script src="_static/doctools.js?v=9bcbadda"></script>
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="_static/clipboard.min.js?v=a7894cd8"></script>
Expand All @@ -64,7 +64,7 @@
<script>
DOCUMENTATION_OPTIONS.theme_version = '0.16.0';
DOCUMENTATION_OPTIONS.theme_switcher_json_url = 'https://raw.githubusercontent.com/skrub-data/skrub/main/doc/version.json';
DOCUMENTATION_OPTIONS.theme_switcher_version_match = '0.4.0';
DOCUMENTATION_OPTIONS.theme_switcher_version_match = '0.4.1';
DOCUMENTATION_OPTIONS.show_version_warning_banner = true;
</script>
<link rel="icon" href="_static/skrub.svg"/>
Expand All @@ -73,7 +73,7 @@
<link rel="search" title="Search" href="search.html" />
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<meta name="docsearch:language" content="en"/>
<meta name="docsearch:version" content="0.4.0" />
<meta name="docsearch:version" content="0.4.1" />
</head>


Expand Down
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
companies or famous people), bringing new information assembled from external
sources may be the key to improving the analysis.
Embeddings, or vectorial representations of entities, are a conveniant way to
Embeddings, or vectorial representations of entities, are a convenient way to
capture and summarize the information on an entity.
Relational data embeddings capture all common entities from Wikipedia. [#]_
These will be called `KEN embeddings` in the following example.
Expand Down Expand Up @@ -204,7 +204,7 @@
# The |Pipeline| can now be readily applied to the dataframe for prediction:
from sklearn.model_selection import cross_validate

# We will save the results in a dictionnary:
# We will save the results in a dictionary:
all_r2_scores = dict()
all_rmse_scores = dict()

Expand Down
Binary file modified 0.4/_downloads/28079b3b8fa6a36780f883fc70c5a85b/01_encodings.zip
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@

###############################################################################
#
# We see that our |fj| succesfully identified the countries,
# We see that our |fj| successfully identified the countries,
# even though some country names differ between tables.
#
# For instance, "Egypt" and "Egypt, Arab Rep." are correctly matched, as are
Expand All @@ -167,7 +167,7 @@
augmented_df.sort_values("skrub_Joiner_rescaled_distance").tail(10)

###############################################################################
# We see that some matches were unsuccesful
# We see that some matches were unsuccessful
# (e.g "Palestinian Territories*" and "Palau"),
# because there is simply no match in the two tables.

Expand Down Expand Up @@ -343,7 +343,7 @@
# many ways to clean a table as there are errors. |fj|
# method is generalizable across all datasets.
#
# Data transformation is also often very costly in both time and ressources.
# Data transformation is also often very costly in both time and resources.
# |fj| is fast and easy-to-use.
#
# Now up to you, try improving our model by adding information into it and
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\n# Wikipedia embeddings to enrich the data\n\nWhen the data comprises common entities (cities,\ncompanies or famous people), bringing new information assembled from external\nsources may be the key to improving the analysis.\n\nEmbeddings, or vectorial representations of entities, are a conveniant way to\ncapture and summarize the information on an entity.\nRelational data embeddings capture all common entities from Wikipedia. [#]_\nThese will be called `KEN embeddings` in the following example.\n\nWe will see that these embeddings of common entities significantly\nimprove our results.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>This example requires `pyarrow` to be installed.</p></div>\n\n.. [#] https://soda-inria.github.io/ken_embeddings/\n\n\n .. |Pipeline| replace::\n :class:`~sklearn.pipeline.Pipeline`\n\n .. |OneHotEncoder| replace::\n :class:`~sklearn.preprocessing.OneHotEncoder`\n\n .. |ColumnTransformer| replace::\n :class:`~sklearn.compose.ColumnTransformer`\n\n .. |MinHash| replace::\n :class:`~skrub.MinHashEncoder`\n\n .. |HGBR| replace::\n :class:`~sklearn.ensemble.HistGradientBoostingRegressor`\n"
"\n# Wikipedia embeddings to enrich the data\n\nWhen the data comprises common entities (cities,\ncompanies or famous people), bringing new information assembled from external\nsources may be the key to improving the analysis.\n\nEmbeddings, or vectorial representations of entities, are a convenient way to\ncapture and summarize the information on an entity.\nRelational data embeddings capture all common entities from Wikipedia. [#]_\nThese will be called `KEN embeddings` in the following example.\n\nWe will see that these embeddings of common entities significantly\nimprove our results.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>This example requires `pyarrow` to be installed.</p></div>\n\n.. [#] https://soda-inria.github.io/ken_embeddings/\n\n\n .. |Pipeline| replace::\n :class:`~sklearn.pipeline.Pipeline`\n\n .. |OneHotEncoder| replace::\n :class:`~sklearn.preprocessing.OneHotEncoder`\n\n .. |ColumnTransformer| replace::\n :class:`~sklearn.compose.ColumnTransformer`\n\n .. |MinHash| replace::\n :class:`~skrub.MinHashEncoder`\n\n .. |HGBR| replace::\n :class:`~sklearn.ensemble.HistGradientBoostingRegressor`\n"
]
},
{
Expand Down Expand Up @@ -263,7 +263,7 @@
},
"outputs": [],
"source": [
"from sklearn.model_selection import cross_validate\n\n# We will save the results in a dictionnary:\nall_r2_scores = dict()\nall_rmse_scores = dict()\n\ncv_results = cross_validate(\n pipeline, X_full, y, scoring=[\"r2\", \"neg_root_mean_squared_error\"]\n)\n\nall_r2_scores[\"Base features\"] = cv_results[\"test_r2\"]\nall_rmse_scores[\"Base features\"] = -cv_results[\"test_neg_root_mean_squared_error\"]\n\nprint(\"With base features:\")\nprint(\n f\"Mean R2 is {all_r2_scores['Base features'].mean():.2f} +-\"\n f\" {all_r2_scores['Base features'].std():.2f} and the RMSE is\"\n f\" {all_rmse_scores['Base features'].mean():.2f} +-\"\n f\" {all_rmse_scores['Base features'].std():.2f}\"\n)"
"from sklearn.model_selection import cross_validate\n\n# We will save the results in a dictionary:\nall_r2_scores = dict()\nall_rmse_scores = dict()\n\ncv_results = cross_validate(\n pipeline, X_full, y, scoring=[\"r2\", \"neg_root_mean_squared_error\"]\n)\n\nall_r2_scores[\"Base features\"] = cv_results[\"test_r2\"]\nall_rmse_scores[\"Base features\"] = -cv_results[\"test_neg_root_mean_squared_error\"]\n\nprint(\"With base features:\")\nprint(\n f\"Mean R2 is {all_r2_scores['Base features'].mean():.2f} +-\"\n f\" {all_r2_scores['Base features'].std():.2f} and the RMSE is\"\n f\" {all_rmse_scores['Base features'].mean():.2f} +-\"\n f\" {all_rmse_scores['Base features'].std():.2f}\"\n)"
]
},
{
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that our |fj| succesfully identified the countries,\neven though some country names differ between tables.\n\nFor instance, \"Egypt\" and \"Egypt, Arab Rep.\" are correctly matched, as are\n\"Lesotho*\" and \"Lesotho\".\n\n.. topic:: Note:\n\n This would all be missed out if we were using other methods such as\n [pandas.merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html),\n which can only find exact matches.\n In this case, to reach the best result, we would have to `manually` clean\n the data (e.g. remove the * after country name) and look\n for matching patterns in every observation.\n\nLet's do some more inspection of the merging done.\n\n"
"We see that our |fj| successfully identified the countries,\neven though some country names differ between tables.\n\nFor instance, \"Egypt\" and \"Egypt, Arab Rep.\" are correctly matched, as are\n\"Lesotho*\" and \"Lesotho\".\n\n.. topic:: Note:\n\n This would all be missed out if we were using other methods such as\n [pandas.merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html),\n which can only find exact matches.\n In this case, to reach the best result, we would have to `manually` clean\n the data (e.g. remove the * after country name) and look\n for matching patterns in every observation.\n\nLet's do some more inspection of the merging done.\n\n"
]
},
{
Expand All @@ -237,7 +237,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that some matches were unsuccesful\n(e.g \"Palestinian Territories*\" and \"Palau\"),\nbecause there is simply no match in the two tables.\n\n"
"We see that some matches were unsuccessful\n(e.g \"Palestinian Territories*\" and \"Palau\"),\nbecause there is simply no match in the two tables.\n\n"
]
},
{
Expand Down Expand Up @@ -452,7 +452,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"We have a satisfying first result: an R\u00b2 of 0.63!\n\nData cleaning varies from dataset to dataset: there are as\nmany ways to clean a table as there are errors. |fj|\nmethod is generalizable across all datasets.\n\nData transformation is also often very costly in both time and ressources.\n|fj| is fast and easy-to-use.\n\nNow up to you, try improving our model by adding information into it and\nbeating our result!\n\n"
"We have a satisfying first result: an R\u00b2 of 0.63!\n\nData cleaning varies from dataset to dataset: there are as\nmany ways to clean a table as there are errors. |fj|\nmethod is generalizable across all datasets.\n\nData transformation is also often very costly in both time and resources.\n|fj| is fast and easy-to-use.\n\nNow up to you, try improving our model by adding information into it and\nbeating our result!\n\n"
]
},
{
Expand Down
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
|joiner| is a scikit-learn compatible transformer that enables
performing joins across multiple keys,
independantly of the data type (numerical, string or mixed).
independently of the data type (numerical, string or mixed).
The following example uses US domestic flights data
to illustrate how space and time information from a
Expand Down Expand Up @@ -106,7 +106,7 @@
aux.head()

###############################################################################
# Then we join this table with the airports so that we get all auxilliary
# Then we join this table with the airports so that we get all auxiliary
# tables into one.

from skrub import Joiner
Expand All @@ -119,7 +119,7 @@

###############################################################################
# Joining airports with flights data:
# Let's instanciate another multiple key joiner on the date and the airport:
# Let's instantiate another multiple key joiner on the date and the airport:

joiner = Joiner(
aux_augmented,
Expand Down
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"\n\n# Spatial join for flight data: Joining across multiple columns\n\nJoining tables may be difficult if one entry on one side does not have\nan exact match on the other side.\n\nThis problem becomes even more complex when multiple columns\nare significant for the join. For instance, this is the case\nfor **spatial joins** on two columns, typically\nlongitude and latitude.\n\n|joiner| is a scikit-learn compatible transformer that enables\nperforming joins across multiple keys,\nindependantly of the data type (numerical, string or mixed).\n\nThe following example uses US domestic flights data\nto illustrate how space and time information from a\npool of tables are combined for machine learning.\n\n.. |fj| replace:: :func:`~skrub.fuzzy_join`\n\n.. |joiner| replace:: :func:`~skrub.Joiner`\n\n.. |Pipeline| replace::\n :class:`~sklearn.pipeline.Pipeline`\n"
"\n\n# Spatial join for flight data: Joining across multiple columns\n\nJoining tables may be difficult if one entry on one side does not have\nan exact match on the other side.\n\nThis problem becomes even more complex when multiple columns\nare significant for the join. For instance, this is the case\nfor **spatial joins** on two columns, typically\nlongitude and latitude.\n\n|joiner| is a scikit-learn compatible transformer that enables\nperforming joins across multiple keys,\nindependently of the data type (numerical, string or mixed).\n\nThe following example uses US domestic flights data\nto illustrate how space and time information from a\npool of tables are combined for machine learning.\n\n.. |fj| replace:: :func:`~skrub.fuzzy_join`\n\n.. |joiner| replace:: :func:`~skrub.Joiner`\n\n.. |Pipeline| replace::\n :class:`~sklearn.pipeline.Pipeline`\n"
]
},
{
Expand Down Expand Up @@ -133,7 +133,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Then we join this table with the airports so that we get all auxilliary\ntables into one.\n\n"
"Then we join this table with the airports so that we get all auxiliary\ntables into one.\n\n"
]
},
{
Expand All @@ -151,7 +151,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Joining airports with flights data:\nLet's instanciate another multiple key joiner on the date and the airport:\n\n"
"Joining airports with flights data:\nLet's instantiate another multiple key joiner on the date and the airport:\n\n"
]
},
{
Expand Down
Binary file modified 0.4/_images/sphx_glr_01_encodings_001.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_01_encodings_thumb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_02_text_with_string_encoders_001.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_02_text_with_string_encoders_002.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_02_text_with_string_encoders_003.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_02_text_with_string_encoders_004.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_02_text_with_string_encoders_005.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_02_text_with_string_encoders_thumb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_08_join_aggregation_001.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_08_join_aggregation_thumb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_09_interpolation_join_001.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_09_interpolation_join_002.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_09_interpolation_join_003.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified 0.4/_images/sphx_glr_09_interpolation_join_thumb.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 38 additions & 1 deletion 0.4/_sources/CHANGES.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,43 @@ Release history

.. currentmodule:: skrub

Release 0.4.1
=============

Changes
-------
* A new parameter ``verbose`` has been added to the :class:`TableReport` to toggle on or off the
printing of progress information when a report is being generated.
:pr:`1182` by :user:`Priscilla Baah<priscilla-b>`.

* A parameter ``verbose`` has been added to the :func:`patch_display` to toggle on or off the
printing of progress information when a table report is being generated.
:pr:`1188` by :user:`Priscilla Baah<priscilla-b>`.

* :func:`tabular_learner` accepts the alias ``"regression"`` for the option
``"regressor"`` and ``"classification"`` for ``"classifier"``.
:pr:`1180` by :user:`Mojdeh Rastgoo <mrastgoo>`.

Bug fixes
---------
* Generating a ``TableReport`` could have an effect on the matplotib
configuration which could cause plots not to display inline in jupyter
notebooks any more. This has been fixed in skrub in :pr:`1172` by
:user:`Jérôme Dockès <jeromedockes>` and the matplotlib issue can be tracked
`here <https://github.com/matplotlib/matplotlib/issues/25041>`_.

* The labels on bar plots in the ``TableReport`` for columns of object dtypes
that have a repr spanning multiple lines could be unreadable. This has been
fixed in :pr:`1196` by :user:`Jérôme Dockès <jeromedockes>`.

* Improve the performance of :func:`deduplicate` by removing some unnecessary
computations. :pr:`1193` by :user:`Jérôme Dockès <jeromedockes>`.

Maintenance
-----------
* Make ``skrub`` compatible with scikit-learn 1.6.
:pr:`1169` by :user:`Guillaume Lemaitre <glemaitre>`.

Release 0.4.0
=============

Expand Down Expand Up @@ -466,7 +503,7 @@ Minor changes
* :class:`TableVectorizer` never output a sparse matrix by default. This can be changed by
increasing the `sparse_threshold` parameter. :pr:`646` by :user:`Leo Grinsztajn <LeoGrin>`

* :class:`TableVectorizer` doesn't fail anymore if an infered type doesn't work during transform.
* :class:`TableVectorizer` doesn't fail anymore if an inferred type doesn't work during transform.
The new entries not matching the type are replaced by missing values. :pr:`666` by :user:`Leo Grinsztajn <LeoGrin>`

- Dataset fetcher :func:`datasets.fetch_employee_salaries` now has a parameter
Expand Down
Loading

0 comments on commit 3b7dff7

Please sign in to comment.