Maximum Context Length Exceeded Due to Base64-Encoded Image in Prompt #24109
Unanswered
tusharraskar
asked this question in
Q&A
Replies: 1 comment
-
To handle the base64-encoded image in the prompt without exceeding the model's context length, you can split the image and text types and then process the image separately. Here is a relevant code snippet to help you achieve this: import base64
import io
from PIL import Image
from langchain_core.documents import Document
from langchain_core.messages import HumanMessage
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
def resize_base64_image(base64_string, size=(128, 128)):
"""
Resize an image encoded as a Base64 string.
:param base64_string: A Base64 encoded string of the image to be resized.
:param size: A tuple representing the new size (width, height) for the image.
:return: A Base64 encoded string of the resized image.
"""
img_data = base64.b64decode(base64_string)
img = Image.open(io.BytesIO(img_data))
resized_img = img.resize(size, Image.LANCZOS)
buffered = io.BytesIO()
resized_img.save(buffered, format=img.format)
return base64.b64encode(buffered.getvalue()).decode("utf-8")
def get_resized_images(docs):
"""
Resize images from base64-encoded strings.
:param docs: A list of base64-encoded image to be resized.
:return: Dict containing a list of resized base64-encoded strings.
"""
b64_images = []
for doc in docs:
if isinstance(doc, Document):
doc = doc.page_content
resized_image = resize_base64_image(doc, size=(1280, 720))
b64_images.append(resized_image)
return {"images": b64_images}
def img_prompt_func(data_dict, num_images=2):
"""
GPT-4V prompt for image analysis.
:param data_dict: A dict with images and a user-provided question.
:param num_images: Number of images to include in the prompt.
:return: A list containing message objects for each image and the text prompt.
"""
messages = []
if data_dict["context"]["images"]:
for image in data_dict["context"]["images"][:num_images]:
image_message = {
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image}"},
}
messages.append(image_message)
text_message = {
"type": "text",
"text": (
"You are an analyst tasked with answering questions about visual content.\n"
"You will be give a set of image(s) from a slide deck / presentation.\n"
"Use this information to answer the user question. \n"
f"User-provided question: {data_dict['question']}\n\n"
),
}
messages.append(text_message)
return [HumanMessage(content=messages)]
def multi_modal_rag_chain(retriever):
"""
Multi-modal RAG chain,
:param retriever: A function that retrieves the necessary context for the model.
:return: A chain of functions representing the multi-modal RAG process.
"""
# Initialize the multi-modal Large Language Model with specific parameters
model = ChatOpenAI(temperature=0, model="gpt-4-vision-preview", max_tokens=1024)
# Define the RAG pipeline
chain = (
{
"context": retriever | RunnableLambda(get_resized_images),
"question": RunnablePassthrough(),
}
| RunnableLambda(img_prompt_func)
| model
| StrOutputParser()
)
return chain This code splits the base64-encoded images and texts, processes them separately, and then combines them in a way that optimizes the context length for the model [1][2][3]. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Checked other resources
Commit to Help
Example Code
Description
I hosted VLLM on an EC2 instance to extract text data from images using the Phi-3 Vision model. The model is hosted with the following command:
python3 -m vllm.entrypoints.openai.api_server --port 8000 --model microsoft/Phi-3-vision-128k-instruct --trust-remote-code --dtype=half --max_model_len=8192
When running the code, I encounter a BadRequestError due to exceeding the maximum context length. The error message indicates that the total number of tokens requested is 254457, which far exceeds the model's limit of 8192 tokens.
The base64-encoded image is being considered part of the prompt, significantly increasing the token count and leading to the context length issue.
Even if the model's context length were 128k tokens, including a base64-encoded image in the prompt would always exceed the model's limit.
System Info
langchain==0.2.7
langchain-community==0.2.7
langchain-core==0.2.12
langchain-text-splitters==0.2.2
Beta Was this translation helpful? Give feedback.
All reactions