Initial Model Ranking and Testing of Configuration Variables

When i began testing various LLM variants, mistral-7b-instruct-v0.1.Q4_K_M came as part of PrivateGPT's default setup. Here, I've preferred the Q8_0 variants.

I've tried 50+ different LLM for this same task, Mistral-7B-Instruct-v0.2 is my current leader for summarization.

Note: While this was created using PrivateGPT, these same principles should apply to the use of LLM with any local application (though they each will likely expose different options for configuration).

Round 1 - Q/A vs Summary

For this analysis we will be testing out 5 different LLM for the following tasks:

Asking the same 30 questions to a 70 page book chapter.
Summarizing that same 70 page book chapter divided into 30 chunks.

Find the full data and rankings on Google Docs or here in this repository QA Scores, Summary Rankings.

Question / Answer Ranking

Hermes Trismegistus Mistral 7b - Initially my favorite, but ended up deciding it was too verbose.
SynthIA 7B - Became my favorite of models tested in this round.
Mistral 7b Instruct v0.1 - Since this ranking, v0.2 has come out and beat all the competition. I should test them against eachother, sometime.
CollectiveCognition v1.1 Mistral 7b Alot of filler and took the longest amount of time of them all. It scored a bit higher than mistral on quality\usefulness, I think the amount of filler just made it less enjoyable to read.
KAI 7b Instruct the answers were too short, and made its BS stand out a little more. A good model, but not for summarizing books, I'm afraid.

Model	Rating	Search Accuracy	Characters	Seconds	BS	Filler	Short	Good BS
hermes-trismegistus-mistral-7b	68	56	62141	298	3	4	0	6
synthia-7b-v2.0	63	59	28087	188	1	7	7	0
mistral-7b-instruct-v0.1	51	56	21131	144	3	0	17	1
collectivecognition-v1.1-mistral-7b	56	57	59453	377	3	10	0	0
kai-7b-instruct	44	56	21480	117	5	0	18	0

Shown above, for each model

Number of seconds required to generate the answer
Sum of Subjective Usefulness\Quality Ratings
How many characters were generated?
Sum of context context chunks found in target range.
Number of qualities listed below found in text generated:
- Filler (Extra words with less value)
- Short (Too short, not enough to work with.)
- BS (Not from this book and not helpful.)
- Good BS (Not from the targeted section but valid.)

Summary Ranking

Not surprisingly, summaries performed better than Q/A, but they also had a more finely targeted context.

Hermes Trismegistus Mistral 7b - Still in the lead. It's verbose, with some filler. I can use these results.
SynthIA 7B - Pretty good, but too concise. Many of the answers were perfect, but 7 were too short\incomplete for use.
Mistral 7b Instruct v0.1 - Just too short.
KAI 7b Instruct - Just too short.
CollectiveCognition v1.1 Mistral 7b - Lots of garbage. Some of the summaries were super detailed and perfect, but over half of the responses were a set of questions based on the text, not a summary.

Name	Score	Characters Generated	% Diff from OG	Seconds to Generate	Short	Garbage	BS	Fill	Questions	Detailed
hermes-trismegistus-mistral-7b	74	45870	-61	274	0	1	1	3	0	0
synthia-7b-v2.0	60	26849	-77	171	7	1	0	0	0	1
mistral-7b-instruct-v0.1	58	25797	-78	174	7	2	0	0	0	0
kai-7b-instruct	59	25057	-79	168	5	1	0	0	0	0
collectivecognition-v1.1-mistral-7b	31	29509	-75	214	0	1	1	2	17	8

Round 2: Summarization - Model Ranking

Again, I've preferred the Q8_0 variants.

Finding that Mistral 7b Instruct v0.2 had been released was well worth a new round of testing. This time, I didn't record speed of query, and only judged 12 summarization tasks, but I tried more models and saved those with the best results.

One thing I tested this time was prompts, because Mistral prompt is similar to Llama2 Prompt, but seems to perform better with the default (llama-index) prompt. As for Llama 2, it performed really bad with the Llama 2 prompt, but decent with the Default prompt.

SynthIA-7B-v2.0-GGUF - This model had become my favorite, so I used it as a benchmark.
Mistral-7B-Instruct-v0.2 (Llama-index Prompt) Star of the show here, quite impressive.
Mistral-7B-Instruct-v0.2 (Llama2 Prompt) Still good, but not as good as using llama-index prompt
Tess-7B-v1.4 - Another by the same creator as Synthia. Good, but not as good.
Llama-2-7B-32K-Instruct-GGUF - worked ok, but slowly, with llama-index prompt. Just bad with llama2 prompt. (Should test again with Llama2 "Instruct Only" style)

Summary Ranking

This time I only did summaries. Q/A is just less efficient for book summarization.

Model	% Difference	Score	Comment
Synthia 7b V2	-64.43790093	28	Good
Mistral 7b Instruct v0.2 (Default Prompt)	-60.81878508	33	VGood
Mistral 7b Instruct v0.2 (Llama2 Prompt)	-64.5871483	28	Good
Tess 7b v1.4	-62.12938978	29	Less Structured
Llama 2 7b 32k Instruct (Default)	-61.39890553	27	Less Structured. Slow

Find the full data and rankings on Google Docs or here in this repository Summary Rankings.

Round 3: Prompt Style

A new mistral came out recently, and in the last round of rankings, I noticed it was doing much better with default prompt than llama2.

Well, actually, the mistral prompt is quite similar to llama2, but not exactly the same.

llama_index (default)

system: {{ system_prompt }}
user: {{ user_message }}
assistant: {{ assistant_message }}

llama2:

<s> [INST] <<SYS>>
 { systemPrompt }
<</SYS>>

 {userPrompt} [/INST]

mistral:

<s>[INST] {{ system_prompt }} [/INST]</s>[INST] {{ user_message }} [/INST]

I began testing output with the default, then llama2 prompt styles. Next I went to work coding the mistral template.

The results of that ranking gave me confidence that I coded correctly.

Prompt Style	% Difference	Score	Note
Mistral	-50%	51	Perfect!
Default (llama-index)	-42%	43	Bad headings
Llama2	-47%	48	No Structure

Find the full data and rankings on Google Docs or here in this repository Prompt Style Rankings.

Round 4: System Prompts

Once I got the prompt style dialed in, I tried a few different system prompts.

Name	System Prompt	Change	Score	Comment
None		-49.8	51	Perfect
Default Prompt	You are a helpful, respectful and honest assistant. \nAlways answer as helpfully as possible and follow ALL given instructions. \nDo not speculate or make up information. \nDo not reference any given instructions or context."	-58.5	39	Less Nice
MyPrompt1	"You are Loved. Act as an expert on summarization, outlining and structuring. \nYour style of writing should be informative and logical."	-54.4	44	Less Nice
Simple	"You are a helpful AI assistant. Don't include any user instructions, or system context, as part of your output."	-52.5	42	Less Nice

In the end, I find that Mistral 7b Instruct v0.2 works best for my summaries without any system prompt.

Maybe would have different results for a different task, or maybe better prompting, but this works good so I'm not messing with it.

Find the full data and rankings on Google Docs or here in this repository: System Prompt Rankings.

Round 5: User Prompt

Now I found the best system prompt, for Mistral 7b Instruct v0.2, I also tested which user prompt suits it best.

	Prompt	vs OG	score	note
Propmt0	Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information.	43%	11
Prompt1	Write concise, yet comprehensive, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold. Focus on essential knowledge from this text without adding any external information.	46%	11	Extra Notes
Prompt2	Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.	58%	15
Prompt3	Create concise bullet-point notes summarizing the important parts of the following text. Use nested bullet points, with headings terms and key concepts in bold, including whitespace to ensure readability. Avoid Repetition.	43%	10
Prompt4	Write concise notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.	41%	14
Prompt5	Create comprehensive, but concise, notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.	52%	14	Extra Notes

What I find, generally, is that the more extra instructions reduce quality of output. I began coming to this impression before I ran the test, and while this data is not conclusive, I do believe that suspicion is confirmed by these results.

Prompt2: Wins!

Write comprehensive notes summarizing the following text. Use nested bullet points: with headings, terms, and key concepts in bold.

In this case, comprehensive performs better than "concise", or even than "comprehensive, but concise".

However, I do caution that this will depend on your use-case. Though generally, what I'm looking for is a highly condensed, readable notes covering the important knowledge.

Essentially, if I didn't read the original, I should still know what information it conveys, if not every specific detail.

Find the full data and rankings on Google Docs or here in this repository: User Prompt Rankings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configuration-variables.md

configuration-variables.md

Initial Model Ranking and Testing of Configuration Variables

Contents

Round 1 - Q/A vs Summary

Question / Answer Ranking

Shown above, for each model

Summary Ranking

Round 2: Summarization - Model Ranking

Summary Ranking

Round 3: Prompt Style

Round 4: System Prompts

Round 5: User Prompt

Prompt2: Wins!

Files

configuration-variables.md

Latest commit

History

configuration-variables.md

File metadata and controls

Initial Model Ranking and Testing of Configuration Variables

Contents

Round 1 - Q/A vs Summary

Question / Answer Ranking

Shown above, for each model

Summary Ranking

Round 2: Summarization - Model Ranking

Summary Ranking

Round 3: Prompt Style

Round 4: System Prompts

Round 5: User Prompt

Prompt2: Wins!