Model loses its mind in interactive mode after ~4096 tokens. #645

GeorgeUCB · 2023-03-31T03:10:39Z

GeorgeUCB
Mar 31, 2023

I've been playing with all the llama/alpaca models in interactive mode as a chatbot and after around 800 words the output freezes for ~10 minutes before continuing. However, when it does it's as if the starting prompt and everything else before never existed. The model will even start hallucinating input from my end of the conversation, ignoring the reverse prompt.

I've played with every parameter and setting but cannot seem to fix this behavior. I assumed llama.cpp used a rolling window with the option to keep the first N tokens. Maybe it's just a current limitation but I've had a hard time finding others with a similar issue. (or maybe I'm not looking for the right keywords)

niansa · 2023-03-31T07:43:36Z

niansa
Mar 31, 2023

Llama was meant for only contexts of max. 2024 tokens, this is a fundamental design decision of Meta and we can't do anything about it.

1 reply

GeorgeUCB Mar 31, 2023
Author

But is this context fixed? I thought it could be split between a fixed amount and a sliding window. ie. the first 300 tokens + the last 1724 tokens from the conversation would get used to predict the next output.

chrfalch · 2023-04-01T20:56:03Z

chrfalch
Apr 1, 2023

You can see it working by setting a context length of 512 (-c 512) and n_keep to the length of the prompt - It will forget half of what you talked about, but the sliding context works as expected so that it will remember the most recent parts of the conversation. You can also see this implemented in main.cpp (don’t remember the exact line number) :)

0 replies

amartinr · 2023-04-09T16:51:08Z

amartinr
Apr 9, 2023

I'm getting good results with the following custom startup script for 13B models, specifically with gpt4-x-alpaca.

#!/bin/bash

cd "$(dirname "$0")/.." || exit

MODEL="${MODEL:-./models/gpt4-x-alpaca/ggml-model-q4_1.bin}"
USER_NAME="${USER_NAME:-User}"
AI_NAME="${AI_NAME:-ChatLLaMa}"

# Adjust to the number of CPU cores you want to use.
MAX_N_THREAD="$(grep -c ^processor /proc/cpuinfo)"
N_THREAD="${MAX_N_THREAD:-16}"
# Number of tokens to predict (made it larger than default because we want a long interaction)
N_PREDICTS="${N_PREDICTS:-2048}"

# Note: you can also override the generation options by specifying them on the command line:
# For example, override the context size by doing: ./chatLLaMa --ctx_size 1024
GEN_OPTIONS="${GEN_OPTIONS:---ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 1024 --repeat_penalty 1.17647}"

# Prompt generation
read -e -d "$" -p "Enter prompt (end with '$'): "
TMP_PROMPT=$(mktemp)
cat << EOF > $TMP_PROMPT
You are an AI language model designed to assist the User by answering their questions, offering advice, and engaging in casual conversation in a friendly, helpful, and informative manner. You respond clearly, coherently, and you consider the conversation history.
${REPLY}

${USER_NAME}: Hey, how's it going?
${AI_NAME}: Hey there! I'm doing great, thank you. What can I help you with today? Let's have a fun chat!
EOF

PROMPT_SIZE=$(($(wc -c $TMP_PROMPT | awk '{print $1}') * 4 / 3 ))
printf "Prompt size: $PROMPT_SIZE\n"

# shellcheck disable=SC2086 # Intended splitting of GEN_OPTIONS
./main $GEN_OPTIONS \
  --model "$MODEL" \
  --threads "$N_THREAD" \
  --n_predict "$N_PREDICTS" \
  --color --interactive \
  --reverse-prompt "${USER_NAME}:" \
  --file $TMP_PROMPT \
  --temp 0.4 --top_k 30 --top_p 0.18 \
  --repeat_penalty 1.15 \
  --repeat_last_n 256 --ctx_size 2048 \
  --keep $PROMPT_SIZE \
  "$@"
rm -f ${TMP_PROMPT}

With this script you have the option to add some tuning commands to the initial prompt. It usually behave well if you don't overcomplicate the prompt too much.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model loses its mind in interactive mode after ~4096 tokens. #645

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Model loses its mind in interactive mode after ~4096 tokens. #645

GeorgeUCB Mar 31, 2023

Replies: 3 comments · 1 reply

niansa Mar 31, 2023

GeorgeUCB Mar 31, 2023 Author

chrfalch Apr 1, 2023

amartinr Apr 9, 2023

GeorgeUCB
Mar 31, 2023

Replies: 3 comments 1 reply

niansa
Mar 31, 2023

GeorgeUCB Mar 31, 2023
Author

chrfalch
Apr 1, 2023

amartinr
Apr 9, 2023