-
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
d35d957
commit 7a1388d
Showing
1 changed file
with
397 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,397 @@ | ||
{ | ||
"nbformat": 4, | ||
"nbformat_minor": 0, | ||
"metadata": { | ||
"colab": { | ||
"provenance": [], | ||
"toc_visible": true, | ||
"authorship_tag": "ABX9TyOzdQuEjEqIX9Gcuv/hESlK", | ||
"include_colab_link": true | ||
}, | ||
"kernelspec": { | ||
"name": "python3", | ||
"display_name": "Python 3" | ||
}, | ||
"language_info": { | ||
"name": "python" | ||
} | ||
}, | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "view-in-github", | ||
"colab_type": "text" | ||
}, | ||
"source": [ | ||
"<a href=\"https://colab.research.google.com/github/louisbrulenaudet/lemone-embed/blob/main/notebooks/lemone_embed_notebook_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"<img src=\"https://huggingface.co/louisbrulenaudet/lemone-embed-pro/resolve/main/assets/thumbnail.webp\" width=\"800px\">\n", | ||
"\n", | ||
"# Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation\n", | ||
"[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)\n", | ||
"\n", | ||
"<div class=\"not-prose bg-gradient-to-r from-gray-50-to-white text-gray-900 border\" style=\"border-radius: 8px; padding: 0.5rem 1rem;\">\n", | ||
" <p>This series is made up of 7 models, 3 basic models of different sizes trained on 1 epoch, 3 models trained on 2 epochs making up the Boost series and a Pro model with a non-Roberta architecture.</p>\n", | ||
"</div>\n", | ||
"\n", | ||
"This sentence transformers model, specifically designed for French taxation, has been fine-tuned on a dataset comprising 43 million tokens, integrating a blend of semi-synthetic and fully synthetic data generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation.\n", | ||
"\n", | ||
"The model is tailored to meet the specific demands of information retrieval across large-scale tax-related corpora, supporting the implementation of production-ready Retrieval-Augmented Generation (RAG) applications. Its primary purpose is to enhance the efficiency and accuracy of legal processes in the taxation domain, with an emphasis on delivering consistent performance in real-world settings, while also contributing to advancements in legal natural language processing research.\n", | ||
"\n", | ||
"This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.\n", | ||
"\n", | ||
"If you use this code in your research, please use the following BibTeX entry.\n", | ||
"\n", | ||
"```BibTeX\n", | ||
"@misc{louisbrulenaudet2024,\n", | ||
" author = {Louis Brulé Naudet},\n", | ||
" title = {Lemone-Embed: A Series of Fine-Tuned Embedding Models for French Taxation},\n", | ||
" year = {2024}\n", | ||
" howpublished = {\\url{https://huggingface.co/datasets/louisbrulenaudet/lemone-embed-pro}},\n", | ||
"}\n", | ||
"```\n", | ||
"\n", | ||
"## Feedback\n", | ||
"\n", | ||
"If you have any feedback, please reach out at [[email protected]](mailto:[email protected])." | ||
], | ||
"metadata": { | ||
"id": "jus7eI3ptMg_" | ||
} | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# Collecting and installing dependencies" | ||
], | ||
"metadata": { | ||
"id": "X_nanITItWoB" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"!pip3 install chromadb polars datasets sentence-transformers huggingface_hub" | ||
], | ||
"metadata": { | ||
"id": "RBZN_of-tZBl" | ||
}, | ||
"execution_count": null, | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# Importing packages\n", | ||
"\n", | ||
"## Core Database and Data Processing\n", | ||
"\n", | ||
"- ChromaDB: A specialized vector database that will be used to store and query our embeddings efficiently\n", | ||
"- Polars: A modern, high-performance DataFrame library chosen as an alternative to pandas for data manipulation tasks\n", | ||
"\n", | ||
"## Machine Learning Infrastructure\n", | ||
"\n", | ||
"- Datasets: Integration with Hugging Face's dataset library for streamlined data handling\n", | ||
"- PyTorch CUDA: Capability check for GPU acceleration to optimize model performance\n", | ||
"\n", | ||
"## Utility Components\n", | ||
"\n", | ||
"- Hashlib: Implementation of secure hash functions, likely used for creating unique identifiers for documents or embeddings\n", | ||
"- Datetime: Temporal data handling for tracking embedding creation and modifications\n", | ||
"- Type Hints: Comprehensive typing imports for enhanced code documentation and maintainability" | ||
], | ||
"metadata": { | ||
"id": "ujkbUgpZtcTn" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"import hashlib\n", | ||
"\n", | ||
"from datetime import datetime\n", | ||
"from typing import (\n", | ||
" IO,\n", | ||
" TYPE_CHECKING,\n", | ||
" Any,\n", | ||
" Dict,\n", | ||
" List,\n", | ||
" Type,\n", | ||
" Tuple,\n", | ||
" Union,\n", | ||
" Mapping,\n", | ||
" TypeVar,\n", | ||
" Callable,\n", | ||
" Optional,\n", | ||
" Sequence,\n", | ||
")\n", | ||
"\n", | ||
"import chromadb\n", | ||
"import polars as pl\n", | ||
"\n", | ||
"from chromadb.config import Settings\n", | ||
"from chromadb.utils import embedding_functions\n", | ||
"from datasets import Dataset\n", | ||
"from torch.cuda import is_available" | ||
], | ||
"metadata": { | ||
"id": "lWVZ_-Kytr-g" | ||
}, | ||
"execution_count": null, | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# Datasets registration\n", | ||
"\n", | ||
"This cell loads a Parquet dataset from Hugging Face's repository (lemone-docs-embeded) using Polars' efficient lazy loading method (scan_parquet), filters out any rows with null values in the 'text' column to ensure data quality, and finally materializes the data into memory with .collect() for further processing." | ||
], | ||
"metadata": { | ||
"id": "JXimNwAltfOk" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"dataframe = pl.scan_parquet(\n", | ||
" \"hf://datasets/louisbrulenaudet/lemone-docs-embeded/data/train-00000-of-00001.parquet\"\n", | ||
").filter(\n", | ||
" pl.col(\n", | ||
" \"text\"\n", | ||
" ).is_not_null()\n", | ||
").collect()" | ||
], | ||
"metadata": { | ||
"id": "J32rtjmjt4cB" | ||
}, | ||
"execution_count": null, | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"If you want to re-create your dataset from the source, here is a code snippet that will help you:" | ||
], | ||
"metadata": { | ||
"id": "tolO_edV1Cme" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"bofip_dataframe = pl.scan_parquet(\n", | ||
" \"hf://datasets/louisbrulenaudet/bofip/data/train-00000-of-00001.parquet\"\n", | ||
").with_columns(\n", | ||
" [\n", | ||
" (\n", | ||
" pl.lit(\"Bulletin officiel des finances publiques - impôts\").alias(\n", | ||
" \"title_main\"\n", | ||
" )\n", | ||
" ),\n", | ||
" (\n", | ||
" pl.col(\"debut_de_validite\")\n", | ||
" .str.strptime(pl.Date, format=\"%Y-%m-%d\")\n", | ||
" .dt.strftime(\"%Y-%m-%d 00:00:00\")\n", | ||
" ).alias(\"date_publication\"),\n", | ||
" (\n", | ||
" pl.col(\"contenu\")\n", | ||
" .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8)\n", | ||
" .alias(\"hash\")\n", | ||
" )\n", | ||
" ]\n", | ||
").rename(\n", | ||
" {\n", | ||
" \"contenu\": \"text\",\n", | ||
" \"permalien\": \"url_sourcepage\",\n", | ||
" \"identifiant_juridique\": \"id_sub\",\n", | ||
" }\n", | ||
").select(\n", | ||
" [\n", | ||
" \"text\",\n", | ||
" \"title_main\",\n", | ||
" \"id_sub\",\n", | ||
" \"url_sourcepage\",\n", | ||
" \"date_publication\",\n", | ||
" \"hash\"\n", | ||
" ]\n", | ||
")\n", | ||
"\n", | ||
"books: List[str] = [\n", | ||
" \"hf://datasets/louisbrulenaudet/code-douanes/data/train-00000-of-00001.parquet\",\n", | ||
" \"hf://datasets/louisbrulenaudet/code-impots/data/train-00000-of-00001.parquet\",\n", | ||
" \"hf://datasets/louisbrulenaudet/code-impots-annexe-i/data/train-00000-of-00001.parquet\",\n", | ||
" \"hf://datasets/louisbrulenaudet/code-impots-annexe-ii/data/train-00000-of-00001.parquet\",\n", | ||
" \"hf://datasets/louisbrulenaudet/code-impots-annexe-iii/data/train-00000-of-00001.parquet\",\n", | ||
" \"hf://datasets/louisbrulenaudet/code-impots-annexe-iv/data/train-00000-of-00001.parquet\",\n", | ||
" \"hf://datasets/louisbrulenaudet/code-impositions-biens-services/data/train-00000-of-00001.parquet\",\n", | ||
" \"hf://datasets/louisbrulenaudet/livre-procedures-fiscales/data/train-00000-of-00001.parquet\"\n", | ||
"]\n", | ||
"\n", | ||
"legi_dataframe = pl.concat(\n", | ||
" [\n", | ||
" pl.scan_parquet(\n", | ||
" book\n", | ||
" ) for book in books\n", | ||
" ]\n", | ||
").with_columns(\n", | ||
" [\n", | ||
" (\n", | ||
" pl.lit(\"https://www.legifrance.gouv.fr/codes/article_lc/\")\n", | ||
" .add(pl.col(\"id\"))\n", | ||
" .alias(\"url_sourcepage\")\n", | ||
" ),\n", | ||
" (\n", | ||
" pl.col(\"dateDebut\")\n", | ||
" .cast(pl.Int64)\n", | ||
" .map_elements(\n", | ||
" lambda x: datetime.fromtimestamp(x / 1000).strftime(\"%Y-%m-%d %H:%M:%S\"),\n", | ||
" return_dtype=pl.Utf8\n", | ||
" )\n", | ||
" .alias(\"date_publication\")\n", | ||
" ),\n", | ||
" (\n", | ||
" pl.col(\"texte\")\n", | ||
" .map_elements(lambda x: hashlib.sha256(str(x).encode()).hexdigest(), return_dtype=pl.Utf8)\n", | ||
" .alias(\"hash\")\n", | ||
" )\n", | ||
" ]\n", | ||
").rename(\n", | ||
" {\n", | ||
" \"texte\": \"text\",\n", | ||
" \"num\": \"id_sub\",\n", | ||
" }\n", | ||
").select(\n", | ||
" [\n", | ||
" \"text\",\n", | ||
" \"title_main\",\n", | ||
" \"id_sub\",\n", | ||
" \"url_sourcepage\",\n", | ||
" \"date_publication\",\n", | ||
" \"hash\"\n", | ||
" ]\n", | ||
")\n", | ||
"\n", | ||
"print(\"Starting embeddings production...\")\n", | ||
"\n", | ||
"dataframe = pl.concat(\n", | ||
" [\n", | ||
" bofip_dataframe,\n", | ||
" legi_dataframe\n", | ||
" ]\n", | ||
").filter(\n", | ||
" pl.col(\n", | ||
" \"text\"\n", | ||
" ).is_not_null()\n", | ||
").with_columns(\n", | ||
" pl.col(\"text\").map_elements(\n", | ||
" lambda x: sentence_transformer_ef(\n", | ||
" [x]\n", | ||
" )[0].tolist(),\n", | ||
" return_dtype=pl.List(pl.Float64)\n", | ||
" ).alias(\"lemone_pro_embeddings\")\n", | ||
").collect()" | ||
], | ||
"metadata": { | ||
"id": "KkOYEOeQ1Kcn" | ||
}, | ||
"execution_count": null, | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# Index creation\n", | ||
"\n", | ||
"This cell initializes a ChromaDB client with telemetry disabled, sets up a SentenceTransformer embedding model (using \"lemone-embed-pro\" with GPU acceleration if available), and creates or retrieves a collection named \"tax\" that will store the document embeddings using this model configuration." | ||
], | ||
"metadata": { | ||
"id": "PX2NybWKthV7" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"client = chromadb.Client(\n", | ||
" settings=Settings(anonymized_telemetry=False)\n", | ||
")\n", | ||
"\n", | ||
"sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(\n", | ||
" model_name=\"louisbrulenaudet/lemone-embed-pro\",\n", | ||
" device=\"cuda\" if is_available() else \"cpu\",\n", | ||
" trust_remote_code=True\n", | ||
")\n", | ||
"\n", | ||
"collection = client.get_or_create_collection(\n", | ||
" name=\"tax\",\n", | ||
" embedding_function=sentence_transformer_ef\n", | ||
")" | ||
], | ||
"metadata": { | ||
"id": "T9OHkgaIt9Ki" | ||
}, | ||
"execution_count": null, | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"Populates the ChromaDB collection by adding document embeddings from the \"lemone_pro_embeddings\" column, their corresponding text content, all remaining columns as metadata, and automatically generated sequential IDs for each document.\n" | ||
], | ||
"metadata": { | ||
"id": "fGQHsmjCvuZW" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"collection.add(\n", | ||
" embeddings=dataframe[\"lemone_pro_embeddings\"].to_list(),\n", | ||
" documents=dataframe[\"text\"].to_list(),\n", | ||
" metadatas=dataframe.remove_columns(\n", | ||
" [\n", | ||
" \"lemone_pro_embeddings\",\n", | ||
" \"text\"\n", | ||
" ]\n", | ||
" ).to_list(),\n", | ||
" ids=[\n", | ||
" str(i) for i in range(0, dataframe.shape[0])\n", | ||
" ]\n", | ||
")" | ||
], | ||
"metadata": { | ||
"id": "VjC22bRauAk-" | ||
}, | ||
"execution_count": null, | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# Collection querying" | ||
], | ||
"metadata": { | ||
"id": "BVJWOhhW3vjW" | ||
} | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"collection.query(\n", | ||
" query_texts=[\"Les personnes morales de droit public ne sont pas assujetties à la taxe sur la valeur ajoutée pour l'activité de leurs services administratifs, sociaux, éducatifs, culturels et sportifs lorsque leur non-assujettissement n'entraîne pas de distorsions dans les conditions de la concurrence.\"],\n", | ||
" n_results=10,\n", | ||
")" | ||
], | ||
"metadata": { | ||
"id": "-xdrJPCRuBQ4" | ||
}, | ||
"execution_count": null, | ||
"outputs": [] | ||
} | ||
] | ||
} |