Créé à l'aide de Colab

louisbrulenaudet · Nov 3, 2024 · 7a1388d · 7a1388d
1 parent d35d957
commit 7a1388d
Showing 1 changed file with 397 additions and 0 deletions.
diff --git a/notebooks/lemone_embed_notebook_tutorial.ipynb b/notebooks/lemone_embed_notebook_tutorial.ipynb
@@ -0,0 +1,397 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "toc_visible": true,
+      "authorship_tag": "ABX9TyOzdQuEjEqIX9Gcuv/hESlK",
+      "include_colab_link": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/louisbrulenaudet/lemone-embed/blob/main/notebooks/lemone_embed_notebook_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "<img src=\"https://huggingface.co/louisbrulenaudet/lemone-embed-pro/resolve/main/assets/thumbnail.webp\" width=\"800px\">\n",
+        "\n",
+        "# Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation\n",
+        "[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)\n",
+        "\n",
+        "<div class=\"not-prose bg-gradient-to-r from-gray-50-to-white text-gray-900 border\" style=\"border-radius: 8px; padding: 0.5rem 1rem;\">\n",
+        "    <p>This series is made up of 7 models, 3 basic models of different sizes trained on 1 epoch, 3 models trained on 2 epochs making up the Boost series and a Pro model with a non-Roberta architecture.</p>\n",
+        "</div>\n",
+        "\n",
+        "This sentence transformers model, specifically designed for French taxation, has been fine-tuned on a dataset comprising 43 million tokens, integrating a blend of semi-synthetic and fully synthetic data generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation.\n",
+        "\n",
+        "The model is tailored to meet the specific demands of information retrieval across large-scale tax-related corpora, supporting the implementation of production-ready Retrieval-Augmented Generation (RAG) applications. Its primary purpose is to enhance the efficiency and accuracy of legal processes in the taxation domain, with an emphasis on delivering consistent performance in real-world settings, while also contributing to advancements in legal natural language processing research.\n",
+        "\n",
+        "This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.\n",
+        "\n",
+        "If you use this code in your research, please use the following BibTeX entry.\n",
+        "\n",
+        "```BibTeX\n",
+        "@misc{louisbrulenaudet2024,\n",
+        "  author =       {Louis Brulé Naudet},\n",
+        "  title =        {Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation},\n",
+        "  year =         {2024}\n",
+        "  howpublished = {\\url{https://huggingface.co/datasets/louisbrulenaudet/lemone-embed-pro}},\n",
+        "}\n",
+        "```\n",
+        "\n",
+        "## Feedback\n",
+        "\n",
+        "If you have any feedback, please reach out at [[email protected]](mailto:[email protected])."
+      ],
+      "metadata": {
+        "id": "jus7eI3ptMg_"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Collecting and installing dependencies"
+      ],
+      "metadata": {
+        "id": "X_nanITItWoB"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip3 install chromadb polars datasets sentence-transformers huggingface_hub"
+      ],
+      "metadata": {
+        "id": "RBZN_of-tZBl"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Importing packages\n",
+        "\n",
+        "## Core Database and Data Processing\n",
+        "\n",
+        "- ChromaDB: A specialized vector database that will be used to store and query our embeddings efficiently\n",
+        "- Polars: A modern, high-performance DataFrame library chosen as an alternative to pandas for data manipulation tasks\n",
+        "\n",
+        "## Machine Learning Infrastructure\n",
+        "\n",
+        "- Datasets: Integration with Hugging Face's dataset library for streamlined data handling\n",
+        "- PyTorch CUDA: Capability check for GPU acceleration to optimize model performance\n",
+        "\n",
+        "## Utility Components\n",
+        "\n",
+        "- Hashlib: Implementation of secure hash functions, likely used for creating unique identifiers for documents or embeddings\n",
+        "- Datetime: Temporal data handling for tracking embedding creation and modifications\n",
+        "- Type Hints: Comprehensive typing imports for enhanced code documentation and maintainability"
+      ],
+      "metadata": {
+        "id": "ujkbUgpZtcTn"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import hashlib\n",
+        "\n",
+        "from datetime import datetime\n",
+        "from typing import (\n",
+        "    IO,\n",
+        "    TYPE_CHECKING,\n",
+        "    Any,\n",
+        "    Dict,\n",
+        "    List,\n",
+        "    Type,\n",
+        "    Tuple,\n",
+        "    Union,\n",
+        "    Mapping,\n",
+        "    TypeVar,\n",
+        "    Callable,\n",
+        "    Optional,\n",
+        "    Sequence,\n",
+        ")\n",
+        "\n",
+        "import chromadb\n",
+        "import polars as pl\n",
+        "\n",
+        "from chromadb.config import Settings\n",
+        "from chromadb.utils import embedding_functions\n",
+        "from datasets import Dataset\n",
+        "from torch.cuda import is_available"
+      ],
+      "metadata": {
+        "id": "lWVZ_-Kytr-g"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Datasets registration\n",
+        "\n",
+        "This cell loads a Parquet dataset from Hugging Face's repository (lemone-docs-embeded) using Polars' efficient lazy loading method (scan_parquet), filters out any rows with null values in the 'text' column to ensure data quality, and finally materializes the data into memory with .collect() for further processing."
+      ],
+      "metadata": {
+        "id": "JXimNwAltfOk"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "dataframe = pl.scan_parquet(\n",
+        "  \"hf://datasets/louisbrulenaudet/lemone-docs-embeded/data/train-00000-of-00001.parquet\"\n",
+        ").filter(\n",
+        "    pl.col(\n",
+        "        \"text\"\n",
+        "    ).is_not_null()\n",
+        ").collect()"
+      ],
+      "metadata": {
+        "id": "J32rtjmjt4cB"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "If you want to re-create your dataset from the source, here is a code snippet that will help you:"
+      ],
+      "metadata": {
+        "id": "tolO_edV1Cme"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "bofip_dataframe = pl.scan_parquet(\n",
+        "    \"hf://datasets/louisbrulenaudet/bofip/data/train-00000-of-00001.parquet\"\n",
+        ").with_columns(\n",
+        "    [\n",
+        "        (\n",
+        "            pl.lit(\"Bulletin officiel des finances publiques - impôts\").alias(\n",
+        "                \"title_main\"\n",
+        "            )\n",
+        "        ),\n",
+        "        (\n",
+        "            pl.col(\"debut_de_validite\")\n",
+        "            .str.strptime(pl.Date, format=\"%Y-%m-%d\")\n",
+        "            .dt.strftime(\"%Y-%m-%d 00:00:00\")\n",
+        "        ).alias(\"date_publication\"),\n",
+        "        (\n",
+        "            pl.col(\"contenu\")\n",
+        "            .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8)\n",
+        "            .alias(\"hash\")\n",
+        "        )\n",
+        "    ]\n",
+        ").rename(\n",
+        "    {\n",
+        "        \"contenu\": \"text\",\n",
+        "        \"permalien\": \"url_sourcepage\",\n",
+        "        \"identifiant_juridique\": \"id_sub\",\n",
+        "    }\n",
+        ").select(\n",
+        "    [\n",
+        "        \"text\",\n",
+        "        \"title_main\",\n",
+        "        \"id_sub\",\n",
+        "        \"url_sourcepage\",\n",
+        "        \"date_publication\",\n",
+        "        \"hash\"\n",
+        "    ]\n",
+        ")\n",
+        "\n",
+        "books: List[str] = [\n",
+        "    \"hf://datasets/louisbrulenaudet/code-douanes/data/train-00000-of-00001.parquet\",\n",
+        "    \"hf://datasets/louisbrulenaudet/code-impots/data/train-00000-of-00001.parquet\",\n",
+        "    \"hf://datasets/louisbrulenaudet/code-impots-annexe-i/data/train-00000-of-00001.parquet\",\n",
+        "    \"hf://datasets/louisbrulenaudet/code-impots-annexe-ii/data/train-00000-of-00001.parquet\",\n",
+        "    \"hf://datasets/louisbrulenaudet/code-impots-annexe-iii/data/train-00000-of-00001.parquet\",\n",
+        "    \"hf://datasets/louisbrulenaudet/code-impots-annexe-iv/data/train-00000-of-00001.parquet\",\n",
+        "    \"hf://datasets/louisbrulenaudet/code-impositions-biens-services/data/train-00000-of-00001.parquet\",\n",
+        "    \"hf://datasets/louisbrulenaudet/livre-procedures-fiscales/data/train-00000-of-00001.parquet\"\n",
+        "]\n",
+        "\n",
+        "legi_dataframe = pl.concat(\n",
+        "    [\n",
+        "        pl.scan_parquet(\n",
+        "            book\n",
+        "        ) for book in books\n",
+        "    ]\n",
+        ").with_columns(\n",
+        "    [\n",
+        "        (\n",
+        "            pl.lit(\"https://www.legifrance.gouv.fr/codes/article_lc/\")\n",
+        "            .add(pl.col(\"id\"))\n",
+        "            .alias(\"url_sourcepage\")\n",
+        "        ),\n",
+        "        (\n",
+        "            pl.col(\"dateDebut\")\n",
+        "            .cast(pl.Int64)\n",
+        "            .map_elements(\n",
+        "                lambda x: datetime.fromtimestamp(x / 1000).strftime(\"%Y-%m-%d %H:%M:%S\"),\n",
+        "                return_dtype=pl.Utf8\n",
+        "            )\n",
+        "            .alias(\"date_publication\")\n",
+        "        ),\n",
+        "        (\n",
+        "            pl.col(\"texte\")\n",
+        "            .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8)\n",
+        "            .alias(\"hash\")\n",
+        "        )\n",
+        "    ]\n",
+        ").rename(\n",
+        "    {\n",
+        "        \"texte\": \"text\",\n",
+        "        \"num\": \"id_sub\",\n",
+        "    }\n",
+        ").select(\n",
+        "    [\n",
+        "        \"text\",\n",
+        "        \"title_main\",\n",
+        "        \"id_sub\",\n",
+        "        \"url_sourcepage\",\n",
+        "        \"date_publication\",\n",
+        "        \"hash\"\n",
+        "    ]\n",
+        ")\n",
+        "\n",
+        "print(\"Starting embeddings production...\")\n",
+        "\n",
+        "dataframe = pl.concat(\n",
+        "    [\n",
+        "        bofip_dataframe,\n",
+        "        legi_dataframe\n",
+        "    ]\n",
+        ").filter(\n",
+        "    pl.col(\n",
+        "        \"text\"\n",
+        "    ).is_not_null()\n",
+        ").with_columns(\n",
+        "    pl.col(\"text\").map_elements(\n",
+        "        lambda x: sentence_transformer_ef(\n",
+        "            [x]\n",
+        "        )[0].tolist(),\n",
+        "        return_dtype=pl.List(pl.Float64)\n",
+        "    ).alias(\"lemone_pro_embeddings\")\n",
+        ").collect()"
+      ],
+      "metadata": {
+        "id": "KkOYEOeQ1Kcn"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Index creation\n",
+        "\n",
+        "This cell initializes a ChromaDB client with telemetry disabled, sets up a SentenceTransformer embedding model (using \"lemone-embed-pro\" with GPU acceleration if available), and creates or retrieves a collection named \"tax\" that will store the document embeddings using this model configuration."
+      ],
+      "metadata": {
+        "id": "PX2NybWKthV7"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "client = chromadb.Client(\n",
+        "    settings=Settings(anonymized_telemetry=False)\n",
+        ")\n",
+        "\n",
+        "sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(\n",
+        "    model_name=\"louisbrulenaudet/lemone-embed-pro\",\n",
+        "    device=\"cuda\" if is_available() else \"cpu\",\n",
+        "    trust_remote_code=True\n",
+        ")\n",
+        "\n",
+        "collection = client.get_or_create_collection(\n",
+        "    name=\"tax\",\n",
+        "    embedding_function=sentence_transformer_ef\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "T9OHkgaIt9Ki"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Populates the ChromaDB collection by adding document embeddings from the \"lemone_pro_embeddings\" column, their corresponding text content, all remaining columns as metadata, and automatically generated sequential IDs for each document.\n"
+      ],
+      "metadata": {
+        "id": "fGQHsmjCvuZW"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "collection.add(\n",
+        "    embeddings=dataframe[\"lemone_pro_embeddings\"].to_list(),\n",
+        "    documents=dataframe[\"text\"].to_list(),\n",
+        "    metadatas=dataframe.remove_columns(\n",
+        "        [\n",
+        "            \"lemone_pro_embeddings\",\n",
+        "            \"text\"\n",
+        "        ]\n",
+        "    ).to_list(),\n",
+        "    ids=[\n",
+        "        str(i) for i in range(0, dataframe.shape[0])\n",
+        "    ]\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "VjC22bRauAk-"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Collection querying"
+      ],
+      "metadata": {
+        "id": "BVJWOhhW3vjW"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "collection.query(\n",
+        "    query_texts=[\"Les personnes morales de droit public ne sont pas assujetties à la taxe sur la valeur ajoutée pour l'activité de leurs services administratifs, sociaux, éducatifs, culturels et sportifs lorsque leur non-assujettissement n'entraîne pas de distorsions dans les conditions de la concurrence.\"],\n",
+        "    n_results=10,\n",
+        ")"
+      ],
+      "metadata": {
+        "id": "-xdrJPCRuBQ4"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}