Pushing the docs to 0.4/ for branch: 0.4.X, commit 41cd67ff6c06a91a79…

…558e24f85aed057c930064
skrub-data · Dec 11, 2024 · 3b7dff7 · 3b7dff7
1 parent 70daa5e
commit 3b7dff7
Show file tree

Hide file tree

Showing 127 changed files with 51,771 additions and 51,474 deletions.
diff --git a/0.4/.buildinfo b/0.4/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 17e96c066813edaab4b4a0687e164778
+config: 2b7368db1424b555105e09181716ffc1
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/0.4/CHANGES.html b/0.4/CHANGES.html
diff --git a/0.4/CONTRIBUTING.html b/0.4/CONTRIBUTING.html
diff --git a/0.4/RELEASE_PROCESS.html b/0.4/RELEASE_PROCESS.html
@@ -54,7 +54,7 @@
   <link rel="preload" as="script" href="_static/scripts/bootstrap.js?digest=26a4bc78f4c0ddb94549" />
 <link rel="preload" as="script" href="_static/scripts/pydata-sphinx-theme.js?digest=26a4bc78f4c0ddb94549" />
 
-    <script src="_static/documentation_options.js?v=6c02275b"></script>
+    <script src="_static/documentation_options.js?v=c87aa342"></script>
     <script src="_static/doctools.js?v=9bcbadda"></script>
     <script src="_static/sphinx_highlight.js?v=dc90522c"></script>
     <script src="_static/clipboard.min.js?v=a7894cd8"></script>
@@ -64,7 +64,7 @@
     <script>
         DOCUMENTATION_OPTIONS.theme_version = '0.16.0';
         DOCUMENTATION_OPTIONS.theme_switcher_json_url = 'https://raw.githubusercontent.com/skrub-data/skrub/main/doc/version.json';
-        DOCUMENTATION_OPTIONS.theme_switcher_version_match = '0.4.0';
+        DOCUMENTATION_OPTIONS.theme_switcher_version_match = '0.4.1';
         DOCUMENTATION_OPTIONS.show_version_warning_banner = true;
         </script>
     <link rel="icon" href="_static/skrub.svg"/>
@@ -73,7 +73,7 @@
     <link rel="search" title="Search" href="search.html" />
   <meta name="viewport" content="width=device-width, initial-scale=1"/>
   <meta name="docsearch:language" content="en"/>
-  <meta name="docsearch:version" content="0.4.0" />
+  <meta name="docsearch:version" content="0.4.1" />
   </head>
 
 

diff --git a/0.4/_downloads/1beb4a6828e42ba7f09ae925e26e794b/02_text_with_string_encoders.zip b/0.4/_downloads/1beb4a6828e42ba7f09ae925e26e794b/02_text_with_string_encoders.zip
diff --git a/0.4/_downloads/1edecd3a067d076f9dde43db0a6ad2ac/00_getting_started.zip b/0.4/_downloads/1edecd3a067d076f9dde43db0a6ad2ac/00_getting_started.zip
diff --git a/0.4/_downloads/216656c2e8e8a33463fe5d69fd4a4e00/06_ken_embeddings.py b/0.4/_downloads/216656c2e8e8a33463fe5d69fd4a4e00/06_ken_embeddings.py
@@ -6,7 +6,7 @@
 companies or famous people), bringing new information assembled from external
 sources may be the key to improving the analysis.
 
-Embeddings, or vectorial representations of entities, are a conveniant way to
+Embeddings, or vectorial representations of entities, are a convenient way to
 capture and summarize the information on an entity.
 Relational data embeddings capture all common entities from Wikipedia. [#]_
 These will be called `KEN embeddings` in the following example.
@@ -204,7 +204,7 @@
 # The |Pipeline| can now be readily applied to the dataframe for prediction:
 from sklearn.model_selection import cross_validate
 
-# We will save the results in a dictionnary:
+# We will save the results in a dictionary:
 all_r2_scores = dict()
 all_rmse_scores = dict()
 

diff --git a/0.4/_downloads/28079b3b8fa6a36780f883fc70c5a85b/01_encodings.zip b/0.4/_downloads/28079b3b8fa6a36780f883fc70c5a85b/01_encodings.zip
diff --git a/0.4/_downloads/54e919ac10b17463c17e70b4cfe05748/09_interpolation_join.zip b/0.4/_downloads/54e919ac10b17463c17e70b4cfe05748/09_interpolation_join.zip
diff --git a/0.4/_downloads/64f1019ef941c218dbb15293bace7ef0/04_fuzzy_joining.py b/0.4/_downloads/64f1019ef941c218dbb15293bace7ef0/04_fuzzy_joining.py
@@ -143,7 +143,7 @@
 
 ###############################################################################
 #
-# We see that our |fj| succesfully identified the countries,
+# We see that our |fj| successfully identified the countries,
 # even though some country names differ between tables.
 #
 # For instance, "Egypt" and "Egypt, Arab Rep." are correctly matched, as are
@@ -167,7 +167,7 @@
 augmented_df.sort_values("skrub_Joiner_rescaled_distance").tail(10)
 
 ###############################################################################
-# We see that some matches were unsuccesful
+# We see that some matches were unsuccessful
 # (e.g "Palestinian Territories*" and "Palau"),
 # because there is simply no match in the two tables.
 
@@ -343,7 +343,7 @@
 # many ways to clean a table as there are errors. |fj|
 # method is generalizable across all datasets.
 #
-# Data transformation is also often very costly in both time and ressources.
+# Data transformation is also often very costly in both time and resources.
 # |fj| is fast and easy-to-use.
 #
 # Now up to you, try improving our model by adding information into it and

diff --git a/0.4/_downloads/707aeb51609bd2078f67c6216133ecc4/06_ken_embeddings.ipynb b/0.4/_downloads/707aeb51609bd2078f67c6216133ecc4/06_ken_embeddings.ipynb
@@ -4,7 +4,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Wikipedia embeddings to enrich the data\n\nWhen the data comprises common entities (cities,\ncompanies or famous people), bringing new information assembled from external\nsources may be the key to improving the analysis.\n\nEmbeddings, or vectorial representations of entities, are a conveniant way to\ncapture and summarize the information on an entity.\nRelational data embeddings capture all common entities from Wikipedia. [#]_\nThese will be called `KEN embeddings` in the following example.\n\nWe will see that these embeddings of common entities significantly\nimprove our results.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>This example requires `pyarrow` to be installed.</p></div>\n\n.. [#] https://soda-inria.github.io/ken_embeddings/\n\n\n .. |Pipeline| replace::\n     :class:`~sklearn.pipeline.Pipeline`\n\n .. |OneHotEncoder| replace::\n     :class:`~sklearn.preprocessing.OneHotEncoder`\n\n .. |ColumnTransformer| replace::\n     :class:`~sklearn.compose.ColumnTransformer`\n\n .. |MinHash| replace::\n     :class:`~skrub.MinHashEncoder`\n\n .. |HGBR| replace::\n     :class:`~sklearn.ensemble.HistGradientBoostingRegressor`\n"
+        "\n# Wikipedia embeddings to enrich the data\n\nWhen the data comprises common entities (cities,\ncompanies or famous people), bringing new information assembled from external\nsources may be the key to improving the analysis.\n\nEmbeddings, or vectorial representations of entities, are a convenient way to\ncapture and summarize the information on an entity.\nRelational data embeddings capture all common entities from Wikipedia. [#]_\nThese will be called `KEN embeddings` in the following example.\n\nWe will see that these embeddings of common entities significantly\nimprove our results.\n\n<div class=\"alert alert-info\"><h4>Note</h4><p>This example requires `pyarrow` to be installed.</p></div>\n\n.. [#] https://soda-inria.github.io/ken_embeddings/\n\n\n .. |Pipeline| replace::\n     :class:`~sklearn.pipeline.Pipeline`\n\n .. |OneHotEncoder| replace::\n     :class:`~sklearn.preprocessing.OneHotEncoder`\n\n .. |ColumnTransformer| replace::\n     :class:`~sklearn.compose.ColumnTransformer`\n\n .. |MinHash| replace::\n     :class:`~skrub.MinHashEncoder`\n\n .. |HGBR| replace::\n     :class:`~sklearn.ensemble.HistGradientBoostingRegressor`\n"
       ]
     },
     {
@@ -263,7 +263,7 @@
       },
       "outputs": [],
       "source": [
-        "from sklearn.model_selection import cross_validate\n\n# We will save the results in a dictionnary:\nall_r2_scores = dict()\nall_rmse_scores = dict()\n\ncv_results = cross_validate(\n    pipeline, X_full, y, scoring=[\"r2\", \"neg_root_mean_squared_error\"]\n)\n\nall_r2_scores[\"Base features\"] = cv_results[\"test_r2\"]\nall_rmse_scores[\"Base features\"] = -cv_results[\"test_neg_root_mean_squared_error\"]\n\nprint(\"With base features:\")\nprint(\n    f\"Mean R2 is {all_r2_scores['Base features'].mean():.2f} +-\"\n    f\" {all_r2_scores['Base features'].std():.2f} and the RMSE is\"\n    f\" {all_rmse_scores['Base features'].mean():.2f} +-\"\n    f\" {all_rmse_scores['Base features'].std():.2f}\"\n)"
+        "from sklearn.model_selection import cross_validate\n\n# We will save the results in a dictionary:\nall_r2_scores = dict()\nall_rmse_scores = dict()\n\ncv_results = cross_validate(\n    pipeline, X_full, y, scoring=[\"r2\", \"neg_root_mean_squared_error\"]\n)\n\nall_r2_scores[\"Base features\"] = cv_results[\"test_r2\"]\nall_rmse_scores[\"Base features\"] = -cv_results[\"test_neg_root_mean_squared_error\"]\n\nprint(\"With base features:\")\nprint(\n    f\"Mean R2 is {all_r2_scores['Base features'].mean():.2f} +-\"\n    f\" {all_r2_scores['Base features'].std():.2f} and the RMSE is\"\n    f\" {all_rmse_scores['Base features'].mean():.2f} +-\"\n    f\" {all_rmse_scores['Base features'].std():.2f}\"\n)"
       ]
     },
     {

diff --git a/0.4/_downloads/83483f7235fd079815061ccd8faac253/04_fuzzy_joining.zip b/0.4/_downloads/83483f7235fd079815061ccd8faac253/04_fuzzy_joining.zip
diff --git a/0.4/_downloads/8c6b74509594a2d43b5985e826e9c7a3/05_deduplication.zip b/0.4/_downloads/8c6b74509594a2d43b5985e826e9c7a3/05_deduplication.zip
diff --git a/0.4/_downloads/970f51f4bf1a2b1a24e01a2eb2f41575/07_multiple_key_join.zip b/0.4/_downloads/970f51f4bf1a2b1a24e01a2eb2f41575/07_multiple_key_join.zip
diff --git a/0.4/_downloads/ba27239b29f2825293dfdf887a176885/04_fuzzy_joining.ipynb b/0.4/_downloads/ba27239b29f2825293dfdf887a176885/04_fuzzy_joining.ipynb
@@ -212,7 +212,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "We see that our |fj| succesfully identified the countries,\neven though some country names differ between tables.\n\nFor instance, \"Egypt\" and \"Egypt, Arab Rep.\" are correctly matched, as are\n\"Lesotho*\" and \"Lesotho\".\n\n.. topic:: Note:\n\n   This would all be missed out if we were using other methods such as\n   [pandas.merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html),\n   which can only find exact matches.\n   In this case, to reach the best result, we would have to `manually` clean\n   the data (e.g. remove the * after country name) and look\n   for matching patterns in every observation.\n\nLet's do some more inspection of the merging done.\n\n"
+        "We see that our |fj| successfully identified the countries,\neven though some country names differ between tables.\n\nFor instance, \"Egypt\" and \"Egypt, Arab Rep.\" are correctly matched, as are\n\"Lesotho*\" and \"Lesotho\".\n\n.. topic:: Note:\n\n   This would all be missed out if we were using other methods such as\n   [pandas.merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html),\n   which can only find exact matches.\n   In this case, to reach the best result, we would have to `manually` clean\n   the data (e.g. remove the * after country name) and look\n   for matching patterns in every observation.\n\nLet's do some more inspection of the merging done.\n\n"
       ]
     },
     {
@@ -237,7 +237,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "We see that some matches were unsuccesful\n(e.g \"Palestinian Territories*\" and \"Palau\"),\nbecause there is simply no match in the two tables.\n\n"
+        "We see that some matches were unsuccessful\n(e.g \"Palestinian Territories*\" and \"Palau\"),\nbecause there is simply no match in the two tables.\n\n"
       ]
     },
     {
@@ -452,7 +452,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "We have a satisfying first result: an R\u00b2 of 0.63!\n\nData cleaning varies from dataset to dataset: there are as\nmany ways to clean a table as there are errors. |fj|\nmethod is generalizable across all datasets.\n\nData transformation is also often very costly in both time and ressources.\n|fj| is fast and easy-to-use.\n\nNow up to you, try improving our model by adding information into it and\nbeating our result!\n\n"
+        "We have a satisfying first result: an R\u00b2 of 0.63!\n\nData cleaning varies from dataset to dataset: there are as\nmany ways to clean a table as there are errors. |fj|\nmethod is generalizable across all datasets.\n\nData transformation is also often very costly in both time and resources.\n|fj| is fast and easy-to-use.\n\nNow up to you, try improving our model by adding information into it and\nbeating our result!\n\n"
       ]
     },
     {

diff --git a/0.4/_downloads/c1773159d0b6d9e88a5c11e9ebcef36f/08_join_aggregation.zip b/0.4/_downloads/c1773159d0b6d9e88a5c11e9ebcef36f/08_join_aggregation.zip
diff --git a/0.4/_downloads/cc8ab1ea72a26d9e8521072ed5f0a86f/07_multiple_key_join.py b/0.4/_downloads/cc8ab1ea72a26d9e8521072ed5f0a86f/07_multiple_key_join.py
@@ -14,7 +14,7 @@
 
 |joiner| is a scikit-learn compatible transformer that enables
 performing joins across multiple keys,
-independantly of the data type (numerical, string or mixed).
+independently of the data type (numerical, string or mixed).
 
 The following example uses US domestic flights data
 to illustrate how space and time information from a
@@ -106,7 +106,7 @@
 aux.head()
 
 ###############################################################################
-# Then we join this table with the airports so that we get all auxilliary
+# Then we join this table with the airports so that we get all auxiliary
 # tables into one.
 
 from skrub import Joiner
@@ -119,7 +119,7 @@
 
 ###############################################################################
 # Joining airports with flights data:
-# Let's instanciate another multiple key joiner on the date and the airport:
+# Let's instantiate another multiple key joiner on the date and the airport:
 
 joiner = Joiner(
     aux_augmented,

diff --git a/0.4/_downloads/e6be5a04f3dff3263ebd4b8b19f118e1/03_datetime_encoder.zip b/0.4/_downloads/e6be5a04f3dff3263ebd4b8b19f118e1/03_datetime_encoder.zip
diff --git a/0.4/_downloads/f870a7c1802f4ed4bc406652d2a26055/06_ken_embeddings.zip b/0.4/_downloads/f870a7c1802f4ed4bc406652d2a26055/06_ken_embeddings.zip
diff --git a/0.4/_downloads/fb7b9b6c1c5f7002e4b82aee06e28f4a/07_multiple_key_join.ipynb b/0.4/_downloads/fb7b9b6c1c5f7002e4b82aee06e28f4a/07_multiple_key_join.ipynb
@@ -4,7 +4,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n\n# Spatial join for flight data: Joining across multiple columns\n\nJoining tables may be difficult if one entry on one side does not have\nan exact match on the other side.\n\nThis problem becomes even more complex when multiple columns\nare significant for the join. For instance, this is the case\nfor **spatial joins** on two columns, typically\nlongitude and latitude.\n\n|joiner| is a scikit-learn compatible transformer that enables\nperforming joins across multiple keys,\nindependantly of the data type (numerical, string or mixed).\n\nThe following example uses US domestic flights data\nto illustrate how space and time information from a\npool of tables are combined for machine learning.\n\n.. |fj| replace:: :func:`~skrub.fuzzy_join`\n\n.. |joiner| replace:: :func:`~skrub.Joiner`\n\n.. |Pipeline| replace::\n     :class:`~sklearn.pipeline.Pipeline`\n"
+        "\n\n# Spatial join for flight data: Joining across multiple columns\n\nJoining tables may be difficult if one entry on one side does not have\nan exact match on the other side.\n\nThis problem becomes even more complex when multiple columns\nare significant for the join. For instance, this is the case\nfor **spatial joins** on two columns, typically\nlongitude and latitude.\n\n|joiner| is a scikit-learn compatible transformer that enables\nperforming joins across multiple keys,\nindependently of the data type (numerical, string or mixed).\n\nThe following example uses US domestic flights data\nto illustrate how space and time information from a\npool of tables are combined for machine learning.\n\n.. |fj| replace:: :func:`~skrub.fuzzy_join`\n\n.. |joiner| replace:: :func:`~skrub.Joiner`\n\n.. |Pipeline| replace::\n     :class:`~sklearn.pipeline.Pipeline`\n"
       ]
     },
     {
@@ -133,7 +133,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "Then we join this table with the airports so that we get all auxilliary\ntables into one.\n\n"
+        "Then we join this table with the airports so that we get all auxiliary\ntables into one.\n\n"
       ]
     },
     {
@@ -151,7 +151,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "Joining airports with flights data:\nLet's instanciate another multiple key joiner on the date and the airport:\n\n"
+        "Joining airports with flights data:\nLet's instantiate another multiple key joiner on the date and the airport:\n\n"
       ]
     },
     {

diff --git a/0.4/_images/sphx_glr_01_encodings_001.png b/0.4/_images/sphx_glr_01_encodings_001.png
diff --git a/0.4/_images/sphx_glr_01_encodings_thumb.png b/0.4/_images/sphx_glr_01_encodings_thumb.png
diff --git a/0.4/_images/sphx_glr_02_text_with_string_encoders_001.png b/0.4/_images/sphx_glr_02_text_with_string_encoders_001.png
diff --git a/0.4/_images/sphx_glr_02_text_with_string_encoders_002.png b/0.4/_images/sphx_glr_02_text_with_string_encoders_002.png
diff --git a/0.4/_images/sphx_glr_02_text_with_string_encoders_003.png b/0.4/_images/sphx_glr_02_text_with_string_encoders_003.png
diff --git a/0.4/_images/sphx_glr_02_text_with_string_encoders_004.png b/0.4/_images/sphx_glr_02_text_with_string_encoders_004.png
diff --git a/0.4/_images/sphx_glr_02_text_with_string_encoders_005.png b/0.4/_images/sphx_glr_02_text_with_string_encoders_005.png
diff --git a/0.4/_images/sphx_glr_02_text_with_string_encoders_thumb.png b/0.4/_images/sphx_glr_02_text_with_string_encoders_thumb.png
diff --git a/0.4/_images/sphx_glr_08_join_aggregation_001.png b/0.4/_images/sphx_glr_08_join_aggregation_001.png
diff --git a/0.4/_images/sphx_glr_08_join_aggregation_thumb.png b/0.4/_images/sphx_glr_08_join_aggregation_thumb.png
diff --git a/0.4/_images/sphx_glr_09_interpolation_join_001.png b/0.4/_images/sphx_glr_09_interpolation_join_001.png
diff --git a/0.4/_images/sphx_glr_09_interpolation_join_002.png b/0.4/_images/sphx_glr_09_interpolation_join_002.png
diff --git a/0.4/_images/sphx_glr_09_interpolation_join_003.png b/0.4/_images/sphx_glr_09_interpolation_join_003.png
diff --git a/0.4/_images/sphx_glr_09_interpolation_join_thumb.png b/0.4/_images/sphx_glr_09_interpolation_join_thumb.png
diff --git a/0.4/_sources/CHANGES.rst.txt b/0.4/_sources/CHANGES.rst.txt
@@ -6,6 +6,43 @@ Release history
 
 .. currentmodule:: skrub
 
+Release 0.4.1
+=============
+
+Changes
+-------
+* A new parameter ``verbose`` has been added to the :class:`TableReport` to toggle on or off the
+  printing of progress information when a report is being generated.
+  :pr:`1182` by :user:`Priscilla Baah<priscilla-b>`.
+
+* A parameter ``verbose`` has been added to the :func:`patch_display` to toggle on or off the
+  printing of progress information when a table report is being generated.
+  :pr:`1188` by :user:`Priscilla Baah<priscilla-b>`.
+
+* :func:`tabular_learner` accepts the alias ``"regression"`` for the option
+  ``"regressor"`` and ``"classification"`` for ``"classifier"``.
+  :pr:`1180` by :user:`Mojdeh Rastgoo <mrastgoo>`.
+
+Bug fixes
+---------
+* Generating a ``TableReport`` could have an effect on the matplotib
+  configuration which could cause plots not to display inline in jupyter
+  notebooks any more. This has been fixed in skrub in :pr:`1172` by
+  :user:`Jérôme Dockès <jeromedockes>` and the matplotlib issue can be tracked
+  `here <https://github.com/matplotlib/matplotlib/issues/25041>`_.
+
+* The labels on bar plots in the ``TableReport`` for columns of object dtypes
+  that have a repr spanning multiple lines could be unreadable. This has been
+  fixed in :pr:`1196` by :user:`Jérôme Dockès <jeromedockes>`.
+
+* Improve the performance of :func:`deduplicate` by removing some unnecessary
+  computations. :pr:`1193` by :user:`Jérôme Dockès <jeromedockes>`.
+
+Maintenance
+-----------
+* Make ``skrub`` compatible with scikit-learn 1.6.
+  :pr:`1169` by :user:`Guillaume Lemaitre <glemaitre>`.
+
 Release 0.4.0
 =============
 
@@ -466,7 +503,7 @@ Minor changes
 * :class:`TableVectorizer` never output a sparse matrix by default. This can be changed by
   increasing the `sparse_threshold` parameter. :pr:`646` by :user:`Leo Grinsztajn <LeoGrin>`
 
-* :class:`TableVectorizer` doesn't fail anymore if an infered type doesn't work during transform.
+* :class:`TableVectorizer` doesn't fail anymore if an inferred type doesn't work during transform.
   The new entries not matching the type are replaced by missing values. :pr:`666` by :user:`Leo Grinsztajn <LeoGrin>`
 
 - Dataset fetcher :func:`datasets.fetch_employee_salaries` now has a parameter