Create a .env
file in the root of the repo and set the following environment variables:
OPENAI_API_KEY=<your openai api key>
GROQ_API_KEY=<your groq api key> # required if you are running prompt evals, as the evals uses Mixtral-8b model from groq api to evaluate the prompts against openai's gpt-3.5-turbo model
the app will use the api keys to interact with the openai api and the groq api using this .env file, you do not need to export it manually.
Easiest way to interact with the chat assistant is via the strreamlit app. To run the app, without worrying about the dependencies, you can use the docker.
To build and run our app in docker container, we use docker compose
.
From the root of the repo run the following command:
docker compose up --build
The comand above will bild the container and run the streamlit app. The app will be available at http://localhost:8501
in your browser.
Please check the terminal for logs and errors if any.
NOTE: It will take time to build the container as well as for the streamlit app to start. As the embedding model is large, it will take time to load the model, create index and start the app. (To save time and resources the vector database is not hosted on the cloud, it gets created on the fly when the app starts. This is not the best practice and should be avoided in production.)
First install the dependencies
$ poetry install #first install the dependencies
$ poetry shell #activate the virtual environment
$ python chat_assistant.py
$ streamlit run app.py
Please refer to the escrow_data_retriever.ipynb notebook for the data preparation steps. The notebook explains how the data was retrieved from the Escrow 1024.17 website and how the data was preprocessed to create the dataset for RAG indexingl.
Final rag eval results: RAG evaluation results
I ran a experiments on Direct RAG along with advanced retrieval methods like Sentence Window and Auto-Merging Retrival. Furthermore, experiments included prameter variance as well to find the best configuration for Retrueving Escrow 1024.17 documents. The evaluations was run on these queries. The results are as follows:
The best configuration (good balance of answer and context relevance, and groundedness) was found to be:
- Sentence retrieval window: 1
- chunk size: 128
- effective retrieved context length (node): 384 characters
Notes: although not the cheapest configuration, it was the most effective in terms of Groundedness, answer relevance and context relevance.
The prompt template is created as follows (learnt from the papers below):
# Role
(Role-play prompting is an effective stratigy where we assign a specific role to play during the interaction. This helps the model to "immerse" itself in the role and provide more accurate and relevant answers) ref 1
# Task
(a direct description of what we want the model to do. One technique that works well is to use chain-of-thought prompting ref 2 to guide the model through the task)
# Specifics (provide most inportant notes regarding the task. Integrating Emotional Stimuli ref 3 has showin to increase response quality and accuracy)
# Context
(what is the environment in which the task is to be performed. Fairness-guided Few-shot Prompting ref 4 has shown that providing context helps the model to understand the task better)
# Examples
(giving a few q/a pairs of example questions and answers can help the model understand the task better. This is a good practice to follow. Rethinking the Role of Demonstrations ref 6 explains this in detail)
# Notes
(Additional and repeted notes that can help the model do the task better . lost in the middle paper ref 7 shows that the llms are good at remembering the start and the end of the context better that the middle. so it is important to repeat the task and the context in the notes section briefly. Though newer models are better at finding needle in a hay stack, it is still a good practice to follow)
- Better Zero-Shot Reasoning with Role-Play Prompting
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Large Language Models Understand and Can be Enhanced by Emotional Stimuli
- Fairness-guided Few-shot Prompting for Large Language Models
- Language Models are Few-Shot Learners
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
- Lost in the Middle: How Language Models Use Long Contexts
We are using promptfoo for evaluating our prompts. To run the evaluation, first install promptfoo
$ bun add promptfoo # or npm install -g promptfoo
Then form the eval command from the root folder of this repo as follows:
$ cd prompt_eval_cloud
$ promptfoo eval
make sure the GROQ_API_KEY and OPENAI_API_KEY is set in the .env file as this eval evaluated models from the OpenAI API (gpt-3.5-turbo) and GROQ API(mixtral-8b).
Note: To save time and resources, the evaluation is not thurogh and only a few prompts are evaluated.
To get the detailed view of the evalutaiton, run the following command:
$ promptfoo view -y
A new tab with the following view will open in your browser:
Refer the following notebook on how the dataset was generated and the model.
Refer the following notebook for see how the model was finetuned on the Escrow 1024.17 documents.
(note: this notebook is a colab motebook and it was easy to run the experiments on the google cloud with powerful gpus.)
The finetuned gemma model is available at huggingface
Download the model and place it in the fine_tuned_model
folder in the repo and from the root of the repo run the following command to interact create ollama model.
$ ollama create escrow_gemma -f ./ModelfileGemma
to interact with the model run the following command:
$ ollama run escrow_gemma:latest
NOTE The model finetuning dataset consisted only the positive q/a pairs and no relevent context q/a, to get better performance we need to include negative q/a pairs as well along with some chat data. This will help the model to understand the context better and provide more accurate responses as intended for this application.
Please check out the link to see the evaluation of the fine-tuned model. The model was evaluated on these test and compared to open source models: 'llama3-8b' and 'gemma-8b'.
In the evaluation, the fine-tuned model is named 'escrow_gemma:latest'.
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'pyrotank41/gemma-7b-it-escrow-merged-gguf',
'SM_NUM_GPUS': json.dumps(1)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.4.2"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)
# send request
predictor.predict({
"inputs": "What is the escrow 1024.17 document?",
})