From 126cbff2f428c5c7852e013bfa0f563066770432 Mon Sep 17 00:00:00 2001 From: davidmezzetti <561939+davidmezzetti@users.noreply.github.com> Date: Fri, 14 Apr 2023 16:20:22 -0400 Subject: [PATCH] Add example notebook --- README.md | 1 + examples/01_Introducing_txtinstruct.ipynb | 568 ++++++++++++++++++++++ 2 files changed, 569 insertions(+) create mode 100644 examples/01_Introducing_txtinstruct.ipynb diff --git a/README.md b/README.md index 78e5085..b2f81d6 100644 --- a/README.md +++ b/README.md @@ -37,3 +37,4 @@ The following example notebooks show how to build models with txtinstruct. | Notebook | Description | | |:----------|:-------------|------:| +| [Introducing txtinstruct](https://github.com/neuml/txtinstruct/blob/master/examples/01_Introducing_txtinstruct.ipynb) | Build instruction-tuned datasets and models | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/neuml/txtinstruct/blob/master/examples/01_Introducing_txtinstruct.ipynb) | diff --git a/examples/01_Introducing_txtinstruct.ipynb b/examples/01_Introducing_txtinstruct.ipynb new file mode 100644 index 0000000..868f483 --- /dev/null +++ b/examples/01_Introducing_txtinstruct.ipynb @@ -0,0 +1,568 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + }, + "accelerator": "GPU", + "gpuClass": "standard" + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Introducing txtinstruct\n", + "\n", + "[txtinstruct](https://github.com/neuml/txtinstruct) is a framework for training instruction-tuned models.\n", + "\n", + "The objective of this project is to support open data, open models and integration with your own data. One of the biggest problems today is the lack of licensing clarity with instruction-following datasets and large language models. txtinstruct makes it easy to build your own instruction-following datasets and use those datasets to train instructed-tuned models.\n", + "\n", + "This notebook gives a brief introduction to txtinstruct.\n", + "\n" + ], + "metadata": { + "id": "Qrap-A-3HcWE" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Install dependencies\n", + "\n", + "Install `txtinstruct` and all dependencies." + ], + "metadata": { + "id": "JMCfsNpyLvRL" + } + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "id": "_4BUMf38HV_v" + }, + "outputs": [], + "source": [ + "%%capture\n", + "!pip install git+https://github.com/neuml/txtai git+https://github.com/neuml/txtinstruct" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# Architecture Overview\n", + "\n", + "txtinstruct consists of three components to help train instruction-following models.\n", + "\n", + "The first component is statement generation. Statement generation models create a statement from a context. This statement can be a question or request to describe a concept depending on the model. \n", + "\n", + "The next component is a knowledge source for pulling context. An example knowledge source used in this notebook is a [txtai embeddings index of the full Wikipedia dataset](https://huggingface.co/NeuML/txtai-wikipedia). \n", + "\n", + "The last component is a large language model (LLM) for translating source statements into target statements. If the statement is a question, the LLM answers it. If it's a descriptive statement, the LLM builds a description. In both cases, a prompt is used in combination with the knowledge source context to generate the target text." + ], + "metadata": { + "id": "50GPIQMBH6SV" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Statement Generation Model\n", + "\n", + "Let's show an example on how to use txtinstruct to build a statement generation model. This example builds a question generation model using the [SQuAD dataset](https://huggingface.co/datasets/squad)." + ], + "metadata": { + "id": "Saq8S59yJoUd" + } + }, + { + "cell_type": "code", + "source": [ + "from datasets import load_dataset\n", + "\n", + "from txtinstruct.models import StatementGenerator\n", + "\n", + "# Load SQuAD dataset\n", + "dataset = load_dataset(\"squad\", split=\"train\")\n", + "\n", + "# Train model\n", + "generator = StatementGenerator()\n", + "model, tokenizer = generator(\n", + " \"google/flan-t5-small\",\n", + " dataset,\n", + " \"sequence-sequence\",\n", + " learning_rate=1e-3,\n", + " per_device_train_batch_size=16,\n", + " gradient_accumulation_steps=128 // 16,\n", + " num_train_epochs=0.1,\n", + " logging_steps=100,\n", + ")" + ], + "metadata": { + "id": "7uQaAdtmH3kP" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Note that we only trained the model for a fraction of an epoch for expediency. Under normal circumstances, `num_train_epochs` would be at least 3.\n", + "\n", + "If you've trained models either with txtai or Hugging Face's trainer, you'll recognize many of the options. [See this page](https://neuml.github.io/txtai/pipeline/train/trainer/) to learn more on the configuration options available. " + ], + "metadata": { + "id": "6VMKvnn1K3Y5" + } + }, + { + "cell_type": "code", + "source": [ + "from txtai.pipeline import Sequences\n", + "\n", + "# Load statement generation model\n", + "statements = Sequences((model, tokenizer))\n", + "\n", + "# Run example prompt\n", + "statements(\"\"\"Generate a question using the context below.\n", + "### Context:\n", + "txtai is an open-source platform for semantic search and workflows powered by language models.\"\"\")" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 36 + }, + "id": "-80I7KuZL6vx", + "outputId": "2c9aa8d3-dd3c-46a7-dbf0-7d69281d7abd" + }, + "execution_count": 3, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "'What is the name of the open-source platform for semantic search and workflows?'" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "string" + } + }, + "metadata": {}, + "execution_count": 3 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Given the context, the question above is generated. Next we'll discuss how this helps build an instruction-tuning dataset." + ], + "metadata": { + "id": "0uXzi7NmTIXm" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Build a dataset for Instruction-Tuning\n", + "\n", + "Now that we have a statement generation model, let's build an instruction-tuning dataset. \n", + "\n", + "We'll use the `txtai wikipedia embeddings index` as the knowledge source and `google/flan-t5-base` as our teacher model." + ], + "metadata": { + "id": "F4DN3HNwMY5N" + } + }, + { + "cell_type": "code", + "source": [ + "from txtai.embeddings import Embeddings\n", + "from txtinstruct.data import DatasetBuilder\n", + "\n", + "# Load embeddings\n", + "embeddings = Embeddings()\n", + "embeddings.load(provider=\"huggingface-hub\", container=\"neuml/txtai-wikipedia\")\n", + "\n", + "# Query templates\n", + "templates = [\n", + " \"Tell me about {text}\",\n", + " \"Give an explanation on {text}\",\n", + " \"Provide a quick summary on {text}\",\n", + " \"Explain {text} in simple terms\",\n", + " \"Describe {text}\"\n", + "]\n", + "\n", + "# Build dataset\n", + "builder = DatasetBuilder(Sequences(\"google/flan-t5-base\"), statements, templates)\n", + "builder(\n", + " embeddings.search(\"SELECT id, text FROM txtai WHERE similar('machine learning') AND percentile >= 0.99 LIMIT 5\"),\n", + " 5,\n", + " \"data.json\"\n", + ")" + ], + "metadata": { + "id": "_bMmVywcMSKo" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "!cat data.json" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "MsCw47gmNVzv", + "outputId": "5ae6ea81-682c-439d-a6a9-6d9ff4dba195" + }, + "execution_count": 7, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "[\n", + " {\n", + " \"context\": \"Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. \\nA subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning. Some implementations of machine learning use data and neural networks in a way that mimics the working of a biological brain. In its application across business problems, machine learning is also referred to as predictive analytics.\",\n", + " \"statements\": [\n", + " {\n", + " \"source\": \"What is a subset of machine learning?\",\n", + " \"target\": \"computational statistics\"\n", + " },\n", + " {\n", + " \"source\": \"Tell me about Machine learning\",\n", + " \"target\": \"Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn'\"\n", + " },\n", + " {\n", + " \"source\": \"What is the study of the attacks on machine learning algorithms?\",\n", + " \"target\": \"I don't have data on that\"\n", + " }\n", + " ]\n", + " },\n", + " {\n", + " \"context\": \"In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.\",\n", + " \"statements\": [\n", + " {\n", + " \"source\": \"What is a type of linear classifier?\",\n", + " \"target\": \"binary classifier\"\n", + " },\n", + " {\n", + " \"source\": \"Tell me about Perceptron\",\n", + " \"target\": \"an algorithm for supervised learning of binary classifiers\"\n", + " },\n", + " {\n", + " \"source\": \"Tell me about Machine learning\",\n", + " \"target\": \"I don't have data on that\"\n", + " }\n", + " ]\n", + " },\n", + " {\n", + " \"context\": \"Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A recent survey exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.\",\n", + " \"statements\": [\n", + " {\n", + " \"source\": \"What is the study of the attacks on machine learning algorithms?\",\n", + " \"target\": \"Adversarial machine learning\"\n", + " },\n", + " {\n", + " \"source\": \"Provide a quick summary on Adversarial machine learning\",\n", + " \"target\": \"Adversarial machine learning\"\n", + " },\n", + " {\n", + " \"source\": \"What is a subset of machine learning?\",\n", + " \"target\": \"I don't have data on that\"\n", + " }\n", + " ]\n", + " },\n", + " {\n", + " \"context\": \"In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided in multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation and test sets.\",\n", + " \"statements\": [\n", + " {\n", + " \"source\": \"What is the main purpose of the data set?\",\n", + " \"target\": \"build the model\"\n", + " },\n", + " {\n", + " \"source\": \"Give an explanation on Training, validation, and test data sets\",\n", + " \"target\": \"Training, validation and test data sets are used to build a mathematical model from input data.\"\n", + " },\n", + " {\n", + " \"source\": \"What is a subset of machine learning?\",\n", + " \"target\": \"I don't have data on that\"\n", + " }\n", + " ]\n", + " },\n", + " {\n", + " \"context\": \"Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.\",\n", + " \"statements\": [\n", + " {\n", + " \"source\": \"What is a rule-based machine learning method for discovering interesting relations between variables in large databases?\",\n", + " \"target\": \"Association rule learning\"\n", + " },\n", + " {\n", + " \"source\": \"Give an explanation on Association rule learning\",\n", + " \"target\": \"Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.\"\n", + " },\n", + " {\n", + " \"source\": \"What is a type of linear classifier?\",\n", + " \"target\": \"I don't have data on that\"\n", + " }\n", + " ]\n", + " }\n", + "]" + ] + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Let's take a look at the generated data. The dataset consists of a context and associated list of statements. Each statement is a source-target pair.\n", + "\n", + "Note that there are also unanswerable questions. It's important for the model to not generate an answer when there is no answer. This is often called a \"model hallucination\"." + ], + "metadata": { + "id": "UXsv5laYRT8m" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Train an instruction-tuned model\n", + "\n", + "Now the part we've been waiting for, instruction-tuning a model. " + ], + "metadata": { + "id": "MNTXskERRYAC" + } + }, + { + "cell_type": "code", + "source": [ + "import json\n", + "\n", + "from txtinstruct.models import Instructor\n", + "\n", + "# Read in generated dataset\n", + "with open(\"data.json\", encoding=\"utf-8\") as f:\n", + " data = json.load(f)\n", + "\n", + "# Instruction-tune model\n", + "instructor = Instructor()\n", + "model, tokenizer = instructor(\n", + " \"google/flan-t5-small\", \n", + " data,\n", + " \"sequence-sequence\",\n", + " learning_rate=1e-3,\n", + " per_device_train_batch_size=8,\n", + " gradient_accumulation_steps=128 // 8,\n", + " num_train_epochs=3,\n", + " logging_steps=100,\n", + ")" + ], + "metadata": { + "id": "1-d4gkIdNA5a" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Before testing the instruction-tuned model, let's run a baseline test to see how `google/flan-t5-small` behaves without any fine-tuning.\n", + "\n", + "This next section runs a prompt with a question and context." + ], + "metadata": { + "id": "kpusCjMtSWTa" + } + }, + { + "cell_type": "code", + "source": [ + "from txtai.pipeline import Extractor\n", + "\n", + "def prompt(text):\n", + " template = \"Answer the following question using only the context below. Give a detailed answer. \"\n", + " template += \"Say 'I don't have data on that' when the question can't be answered.\\n\"\n", + " template += f\"Question: {text}\\n\"\n", + " template += \"Context: \"\n", + "\n", + " return template\n", + "\n", + "extractor = Extractor(\n", + " embeddings,\n", + " Sequences(\"google/flan-t5-small\")\n", + ")\n", + "\n", + "extractor([{\n", + " \"query\": \"Tell me about Linux\",\n", + " \"question\": prompt(\"Tell me about Linux\")\n", + "}])" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "F65OghcBNJ2m", + "outputId": "ee9c2e0a-9a91-4dc6-f2f2-eed5a91b81b7" + }, + "execution_count": 10, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'answer': 'Linux'}]" + ] + }, + "metadata": {}, + "execution_count": 10 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Not a very good answer given the question. Let's try another question thats unanswerable." + ], + "metadata": { + "id": "sHqKgjuLShxH" + } + }, + { + "cell_type": "code", + "source": [ + "extractor([{\n", + " \"query\": \"What is the weather in Phoenix today?\",\n", + " \"question\": prompt(\"What is the weather in Phoenix today?\")\n", + "}])" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "PfUWol27NMsH", + "outputId": "b524992f-9751-4569-bdc4-19bf7e5ffbf0" + }, + "execution_count": 11, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'answer': '0.00%'}]" + ] + }, + "metadata": {}, + "execution_count": 11 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "See how the model still tries to give an answer even though there is no answer. \n", + "\n", + "Now let's try the same two questions with our instruction-tuned model. " + ], + "metadata": { + "id": "OYucMf9vSmxp" + } + }, + { + "cell_type": "code", + "source": [ + "extractor = Extractor(\n", + " embeddings,\n", + " Sequences((model, tokenizer))\n", + ")\n", + "\n", + "extractor([{\n", + " \"query\": \"Tell me about Linux\",\n", + " \"question\": prompt(\"Tell me about Linux\")\n", + "}])" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "uNlQu7lWNOmf", + "outputId": "36b3f3aa-a01b-44a7-899f-be4e89b6e7f7" + }, + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'answer': 'Linux (or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which includes the kernel and supporting system software and libraries, many of which are provided by the GNU Project.'}]" + ] + }, + "metadata": {}, + "execution_count": 12 + } + ] + }, + { + "cell_type": "code", + "source": [ + "extractor([{\n", + " \"query\": \"What is the weather in Phoenix today?\",\n", + " \"question\": prompt(\"What is the weather in Phoenix today?\")\n", + "}])" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "dtc-LWn7NRxf", + "outputId": "90ad1160-6ab8-41bd-a501-fb79d2231721" + }, + "execution_count": 13, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[{'answer': \"I don't have data on that\"}]" + ] + }, + "metadata": {}, + "execution_count": 13 + } + ] + }, + { + "cell_type": "markdown", + "source": [ + "Much better indeed. Keep in mind this was only trained with ~15 samples using a relatively small teacher model (`google/flan-t5-base`) for demonstration purposes." + ], + "metadata": { + "id": "O7J2F7OdSucS" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Wrapping up\n", + "\n", + "This notebook introduced txtinstruct, a framework for training instruction-tuned models. This project strives to be an easy-to-use way to build your own instruction-following models with licensing clarity. Stay tuned for more!" + ], + "metadata": { + "id": "3JdW-J0eT2B0" + } + } + ] +} \ No newline at end of file