Awesome-LLM-Eval

English | 中文

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, learderboard, papers, docs and models, mainly for Evaluation on Large Language Models and exploring the boundaries and limits of Generative AI.

News

[2023/09/25] We add ColossalEval from Colossal-AI.
[2023/09/22] We add Leaderboard chapter.
[2023/09/20] We add DeepEval, FinEval and SuperCLUE-Safety from CLUEbenchmark.
[2023/09/18] We add OpenCompass from Shanghai AI Lab.
[2023/08/03] We add new Chinese LLMs: Baichuan and Qwen.
[2023/06/28] We add AlpacaEval and multiple tools.
[2023/04/26] We released the V0.1 Eval list with multiple benchmarks, etc.

Tools

Name	Institute	Link	Date
LLM Comparator	Google	LLM Comparator	a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. (2024-02-16)
EVAL	OPENAI	https://github.com/openai/evals
lm-evaluation-harness	EleutherAI	lm-evaluation-harness
Large language model evaluation and workflow framework from Phase AI	wgryc	phasellm
Evaluation benchmark for large language models	FreedomIntelligence	LLMZoo
Holistic Evaluation of Language Models (HELM)	Stanford	HELM
A lightweight evaluation tool for question-answering	Langchain	auto-evaluator
PandaLM: ReProducible and Automated Language Model Assessment	WeOpenML	PandaLM
FlagEval	Tsinghua University	FlagEval
AlpacaEval	tatsu-lab	AlpacaEval

Datasets / Benchmark

Data Name	Institution	Website	Description
TrustLLM Benchmark	TrustLLM	TrustLLM	TrustLLM is a benchmark for assessing the trustworthiness of large language models. This benchmark encompasses six dimensions of trustworthiness and includes over 30 datasets to comprehensively evaluate the capabilities of LLMs, ranging from simple classification to complex generation tasks. Each dataset presents unique challenges and has been used to benchmark 16 mainstream large language models, including both commercial and open-source models, across multiple dimensions of trustworthiness.
M3Exam	DAMO	M3Exam	A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models.
KoLA	THU-KEG	KoLA	Knowledge-oriented LLM Assessment benchmark (KoLA), hosted by Knowledge Engineering Group, Tsinghua University (THU-KEG), aims to benchmark LLMs' world knowledge by meticulously designing data, ability taxonomy, and evaluation metrics.
promptbench	Microsoft	promptbench	PromptBench is a powerful tool to scrutinize and analyze large language models' interaction with prompts. It simulates black-box adversarial prompt attacks and evaluates model performance. The repository provides code, datasets, and instructions for experiments.
OpenCompass	Shanghai AI Lab	OpenCompass	OpenCompass is an LLM evaluation platform supporting 20+ models over 50+ datasets for comprehensive benchmarking using efficient distributed evaluation techniques.
JioNLP-LLM Evaluation Dataset	jionlp	JioNLP-LLM Evaluation Dataset	The JioNLP-LLM Evaluation Dataset is used to evaluate general LLM performance, focusing on their assistance to users and whether they reach the level of a "smart assistant." It includes multiple-choice questions from various professional exams and subjective questions to assess common LLM functions.
BIG-bench	Google	BIG-bench	BIG bench consists of 204 tasks spanning linguistic, childhood development, mathematical, commonsense reasoning, biological, physical, societal bias, and software development domains.
BIG-Bench-Hard	Stanford NLP	BIG-Bench-Hard	BIG-Bench-Hard (BBH) contains 23 challenging tasks, where prior model evaluations didn't surpass human-rater performance.
SuperCLUE	CLUEbenchmark	SuperCLUE	A Chinese benchmark covering basic, professional, and Chinese-specific abilities with a variety of tasks in semantic understanding, dialogue, logic reasoning, role simulation, coding, and more.
Safety Eval	Tsinghua University	Safety Eval - Safety Large Model Evaluation	An evaluation set by Tsinghua University covering hate speech, prejudice, crime, privacy, ethics, and more, categorized into 40+ safety categories.
GAOKAO-Bench	OpenLMLab	GAOKAO-Bench	GAOKAO-bench evaluates the language understanding and logical reasoning abilities of large models using Chinese college entrance examination questions.
Gaokao	ExpressAI	Gaokao	"GaoKao Benchmark" aims to assess and track our progress in achieving human-level intelligence. It provides a comprehensive evaluation of various tasks and domains for comparison with human performance.
MMLU	paperswithcode.com	MMLU	The MMLU evaluation dataset covers 57 subjects in STEM, humanities, and social sciences, ranging from elementary to professional levels.
CMMLU	MBZUAI & ShangHai JiaoTong & Microsoft	CMMLU	Measuring massive multitask language understanding in Chinese
MMCU	Oracle AI Research	MMCU	MMCU evaluates Chinese large models' performance in medical, legal, psychological, and educational domains.
AGIEval	Microsoft Research	AGIEval	AGIEval comprehensively evaluates base models' cognitive and problem-solving abilities using various official entrance and professional qualification exams.
C_Eval	SJTU, Tsinghua, University of Edinburgh	C_Eval	C_Eval evaluates models' higher-level knowledge and reasoning abilities across 52 disciplines.
XieZhi	Fudan University	XieZhi	XieZhi is a comprehensive evaluation suite for Language Models, spanning various disciplines and difficulty levels.
MT-bench	Multiple Universities	MT-bench	MT-bench is a benchmark with 80 high-quality multi-turn questions designed to test multi-turn conversation and instruction-following ability.
GLUE Benchmark	Multiple Institutions	GLUE Benchmark	GLUE Benchmark evaluates models' performance in various tasks like grammar, paraphrasing, text similarity, inference, textual entailment, and pronoun resolution.
OpenAI Moderation API	OpenAI	OpenAI Moderation API	Filters harmful or unsafe content.
GSM8K	OpenAI	GSM8K	GSM8K is a dataset of linguistically diverse grade school math word problems, testing mathematical problem-solving abilities.
EleutherAI LM Eval	EleutherAI	EleutherAI LM Eval	Evaluates model performance with few-shot tasks and fine-tuning across multiple tasks.
OpenAI Evals	OpenAI	OpenAI Evals	Evaluates generated text for accuracy, diversity, consistency, robustness, transferability, efficiency, and fairness.
AlpacaEval	tatsu-lab	AlpacaEval	An automatic evaluation based on AlpacaFarm evaluation set, comparing responses with reference answers.
Adversarial NLI (ANLI)	Facebook AI Research, others	Adversarial NLI (ANLI)	Evaluates model robustness, generalization, inference explanations, and efficiency under adversarial samples.
LIT (Language Interpretability Tool)	Google	LIT	Provides a platform to evaluate, analyze model strengths, weaknesses, and potential biases based on user-defined metrics.
ParlAI	Facebook AI Research	ParlAI	Evaluates model performance in terms of accuracy, F1 score, perplexity, human ratings, speed, robustness, and generalization.
CoQA	Stanford NLP Group	CoQA	Evaluates models' comprehension of paragraphs and answering related questions in a conversational context.
LAMBADA	University of Trento, Fondazione Bruno Kessler	LAMBADA	Measures models' long-term understanding by predicting the last word of paragraphs.
HellaSwag	University of Washington, Allen Institute for AI	HellaSwag	Evaluates models' reasoning abilities using counterfactual statements.
LogiQA	Tsinghua University, Microsoft Research Asia	LogiQA	Evaluates models' logical reasoning abilities.
MultiNLI	Multiple Institutions	MultiNLI	Evaluates models' ability to understand relationships between sentences from different genres.
SQUAD	Stanford NLP Group	SQUAD	Evaluates models' reading comprehension abilities.
Open LLM Leaderboard	HuggingFace	Leaderboard	HuggingFace's LLM evaluation leaderboard covering AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA datasets.
chinese-llm-benchmark	jeinlee1991	llm-benchmark	Chinese LLM benchmark covering various open-source models and multidimensional evaluations.
AlpacaEval	tatsu-lab	AlpacaEval	LLM-based automatic evaluation for open-source models' performance.
Huggingface Open LLM Leaderboard	huggingface	HF Open LLM Leaderboard	Evaluates open-source models on four evaluation sets: AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA.
lmsys-arena	Berkeley	lmsys Ranking	Rankings based on the Elo rating mechanism: GPT4 > Claude > GPT3.5 > Vicuna > others.
CMU Open-Source Chatbot Evaluation	CMU	zeno-build	Evaluates models in dialogue scenarios, ranking ChatGPT > Vicuna > others.
Z-Bench Chinese ZhenFund Evaluation	ZhenFund	Z-Bench	Evaluates Chinese models, with minor differences; improvements in ChatGLM 6B versions.
Chain-of-thought Evaluation	Yao Fu	COT Evaluation	Rankings include GSM8k, MATH, and complex problems.
InfoQ Large Model Comprehensive Evaluation	InfoQ	InfoQ Evaluation	Chinese-based ranking including ChatGPT, 文心一言, Claude, and 星火.
ToolBench Tool Invocation Evaluation	Zhiguan/ClearHu	ToolBench	Compares models' performance with tool-tuned models and ChatGPT.
AgentBench Inference Decision Evaluation	THUDM	AgentBench	Evaluates models' inference and decision-making abilities in various scenarios like shopping, home, and operating systems.
FlagEval	Zhiguan/ClearHu	FlagEval	Provides LLM ranking using subjective and objective scores.
ChatEval	THU-NLP	ChatEval	Simplifies human evaluation of generated text by involving human raters in discussions.
Zhujiu	Institute of Automation, CAS	Zhujiu	Multidimensional evaluation covering 7 ability dimensions and 51 tasks in both Chinese and English.
LucyEval	Oracle	LucyEval	Evaluates Chinese large models' maturity using objective tests across various abilities.
Do-Not-Answer	Libr-AI	Do-Not-Answer	Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.
ColossalEval	Colossal-AI	ColossalEval	ColossalEval is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs.
SmartPlay	microsoft	SmartPlay	SmartPlay is a benchmark for Large Language Models (LLMs). It is designed to be easy to use, and to provide a wide variety of games to test agents on.
LVLM-eHub	OpenGVLab	LVLM-eHub	Multi-Modality Arena is an evaluation platform for large multi-modality models. Following Fastchat, two anonymous models side-by-side are compared on a visual question-answering task. The Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more.
BLURB	Mindrank AI	BLURB	BLURB comprises of a comprehensive benchmark for PubMed-based biomedical NLP applications, as well as a leaderboard for tracking progress by the community. BLURB includes thirteen publicly available datasets in six diverse tasks. To avoid placing undue emphasis on tasks with many available datasets, such as named entity recognition (NER), BLURB reports the macro average across all tasks as the main score. The BLURB leaderboard is model-agnostic. Any system capable of producing the test predictions using the same training and development data can participate.
SWE-bench	princeton-nlp	SWE-bench	SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.

Demos

Chat Arena: anonymous models side-by-side and vote for which one is better - Open Source AI "Anonymous" Arena! Here, you can become a referee and rate the responses of two models whose names you don't know in advance. After scoring, their real identities will be revealed. The current "participants" include Vicuna, Koala, OpenAssistant (oasst), Dolly, ChatGLM, StableLM, Alpaca, LLaMA, and more.

Leaderborads

Performance from XieZhi-202306

Models	MMLU			CEval			M3KE	Xiezhi-Spec.-Chinese			Xiezhi-Inter.-Chinese			Xiezhi-Spec.-English			Xiezhi-Inter.-English
	0-shot	1-shot	3-shot	0-shot	1-shot	3-shot	0-shot	0-shot	1-shot	3-shot	0-shot	1-shot	3-shot	0-shot	1-shot	3-shot	0-shot	1-shot	3-shot
Random-Guess	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089	0.089
	Generation Probability For Ranking
Bloomz-560m	0.111	0.109	0.119	0.124	0.117	0.103	0.126	0.123	0.127	0.124	0.130	0.138	0.140	0.113	0.116	0.123	0.124	0.117	0.160
Bloomz-1b1	0.131	0.116	0.128	0.107	0.115	0.110	0.082	0.138	0.108	0.107	0.117	0.125	0.123	0.130	0.119	0.114	0.144	0.129	0.145
Bloomz-1b7	0.107	0.117	0.164	0.054	0.058	0.103	0.102	0.165	0.151	0.159	0.152	0.214	0.170	0.133	0.140	0.144	0.150	0.149	0.209
Bloomz-3b	0.139	0.084	0.146	0.168	0.182	0.194	0.063	0.186	0.154	0.168	0.151	0.180	0.182	0.201	0.155	0.156	0.175	0.164	0.158
Bloomz-7b1	0.167	0.160	0.205	0.074	0.072	0.073	0.073	0.154	0.178	0.162	0.148	0.160	0.156	0.176	0.153	0.207	0.217	0.204	0.229
Bloomz-7b1-mt	0.189	0.196	0.210	0.077	0.078	0.158	0.072	0.163	0.175	0.154	0.155	0.195	0.164	0.180	0.146	0.219	0.228	0.171	0.232
Bloomz-7b1-p3	0.066	0.059	0.075	0.071	0.070	0.072	0.081	0.177	0.198	0.158	0.183	0.173	0.170	0.130	0.130	0.162	0.157	0.132	0.134
Bloomz	0.051	0.066	0.053	0.142	0.166	0.240	0.098	0.185	0.133	0.277	0.161	0.099	0.224	0.069	0.082	0.056	0.058	0.055	0.049
Bloomz-mt	0.266	0.264	0.248	0.204	0.164	0.151	0.161	0.253	0.198	0.212	0.213	0.189	0.184	0.379	0.396	0.394	0.383	0.405	0.398
Bloomz-p3	0.115	0.093	0.057	0.118	0.137	0.140	0.115	0.136	0.095	0.105	0.086	0.065	0.098	0.139	0.097	0.069	0.176	0.141	0.070
llama-7b	0.125	0.132	0.093	0.133	0.106	0.110	0.158	0.152	0.141	0.117	0.142	0.135	0.128	0.159	0.165	0.161	0.194	0.183	0.176
llama-13b	0.166	0.079	0.135	0.152	0.181	0.169	0.131	0.133	0.241	0.243	0.211	0.202	0.303	0.154	0.183	0.215	0.174	0.216	0.231
llama-30b	0.076	0.107	0.073	0.079	0.119	0.082	0.079	0.140	0.206	0.162	0.186	0.202	0.183	0.110	0.195	0.161	0.088	0.158	0.219
llama-65b	0.143	0.121	0.100	0.154	0.141	0.168	0.125	0.142	0.129	0.084	0.108	0.077	0.077	0.183	0.204	0.172	0.133	0.191	0.157
baize-7b~(lora)	0.129	0.091	0.079	0.194	0.180	0.206	0.231	0.216	0.148	0.123	0.173	0.158	0.198	0.182	0.190	0.194	0.218	0.188	0.209
baize-7b-healthcare~(lora)	0.130	0.121	0.106	0.178	0.174	0.178	0.203	0.178	0.146	0.123	0.266	0.107	0.118	0.175	0.164	0.173	0.197	0.231	0.198
baize-13b~(lora)	0.131	0.111	0.171	0.184	0.178	0.195	0.155	0.158	0.221	0.256	0.208	0.200	0.219	0.176	0.189	0.239	0.187	0.185	0.274
baize-30b~(lora)	0.193	0.216	0.207	0.191	0.196	0.121	0.071	0.109	0.212	0.190	0.203	0.256	0.200	0.167	0.235	0.168	0.072	0.180	0.193
Belle-0.2M	0.127	0.148	0.243	0.053	0.063	0.136	0.076	0.172	0.126	0.153	0.171	0.165	0.147	0.206	0.146	0.148	0.217	0.150	0.173
Belle-0.6M	0.091	0.114	0.180	0.082	0.080	0.090	0.075	0.188	0.149	0.198	0.188	0.188	0.175	0.173	0.172	0.183	0.193	0.184	0.196
Belle-1M	0.137	0.126	0.162	0.066	0.065	0.072	0.066	0.170	0.152	0.147	0.173	0.176	0.197	0.211	0.137	0.149	0.207	0.151	0.185
Belle-2M	0.127	0.148	0.132	0.058	0.063	0.136	0.057	0.163	0.166	0.130	0.159	0.177	0.163	0.155	0.106	0.166	0.151	0.150	0.138
chatglm-6B	0.099	0.109	0.112	0.084	0.074	0.114	0.115	0.082	0.097	0.147	0.104	0.111	0.144	0.106	0.120	0.124	0.099	0.079	0.097
doctorglm-6b	0.093	0.076	0.065	0.037	0.085	0.051	0.038	0.062	0.068	0.044	0.047	0.056	0.043	0.069	0.053	0.043	0.106	0.059	0.059
moss-base-16B	0.072	0.050	0.062	0.115	0.048	0.052	0.099	0.105	0.051	0.059	0.123	0.054	0.058	0.124	0.077	0.080	0.121	0.058	0.063
moss-sft-16B	0.064	0.065	0.051	0.063	0.062	0.072	0.075	0.072	0.067	0.068	0.073	0.081	0.066	0.071	0.070	0.059	0.074	0.084	0.075
vicuna-7b	0.051	0.051	0.029	0.063	0.071	0.064	0.059	0.169	0.171	0.165	0.134	0.201	0.213	0.182	0.209	0.195	0.200	0.214	0.182
vicuna-13b	0.109	0.104	0.066	0.060	0.131	0.131	0.067	0.171	0.167	0.166	0.143	0.147	0.178	0.121	0.139	0.128	0.158	0.174	0.191
alpaca-7b	0.135	0.170	0.202	0.137	0.119	0.113	0.142	0.129	0.139	0.123	0.178	0.104	0.097	0.189	0.179	0.128	0.200	0.185	0.149
pythia-1.4b	0.124	0.127	0.121	0.108	0.132	0.138	0.083	0.125	0.128	0.135	0.111	0.146	0.135	0.158	0.124	0.124	0.166	0.126	0.118
pythia-2.8b	0.103	0.110	0.066	0.064	0.089	0.122	0.086	0.114	0.120	0.131	0.091	0.113	0.112	0.126	0.118	0.112	0.110	0.145	0.107
pythia-6.9b	0.115	0.070	0.084	0.078	0.073	0.094	0.073	0.086	0.094	0.092	0.097	0.098	0.085	0.091	0.088	0.083	0.099	0.099	0.096
pythia-12b	0.075	0.059	0.066	0.077	0.097	0.078	0.098	0.102	0.126	0.132	0.125	0.147	0.159	0.079	0.098	0.110	0.094	0.120	0.120
gpt-neox-20b	0.081	0.132	0.086	0.086	0.096	0.069	0.094	0.140	0.103	0.109	0.120	0.098	0.085	0.088	0.101	0.116	0.099	0.113	0.156
h2ogpt-12b	0.075	0.087	0.078	0.080	0.078	0.094	0.070	0.065	0.047	0.073	0.076	0.061	0.091	0.088	0.050	0.065	0.105	0.063	0.067
h2ogpt-20b	0.114	0.098	0.110	0.094	0.084	0.061	0.096	0.108	0.080	0.073	0.086	0.081	0.072	0.108	0.068	0.086	0.109	0.071	0.079
dolly-3b	0.066	0.060	0.055	0.079	0.083	0.077	0.066	0.100	0.090	0.083	0.091	0.093	0.085	0.079	0.063	0.077	0.076	0.074	0.084
dolly-7b	0.095	0.068	0.052	0.091	0.079	0.070	0.108	0.108	0.089	0.092	0.111	0.095	0.100	0.096	0.059	0.086	0.123	0.085	0.090
dolly-12b	0.095	0.068	0.093	0.085	0.071	0.073	0.114	0.098	0.106	0.103	0.094	0.114	0.106	0.086	0.088	0.098	0.088	0.102	0.116
stablelm-3b	0.070	0.085	0.071	0.086	0.082	0.099	0.096	0.101	0.087	0.091	0.083	0.092	0.067	0.069	0.089	0.081	0.066	0.085	0.088
stablelm-7b	0.158	0.118	0.093	0.133	0.102	0.093	0.140	0.085	0.118	0.122	0.123	0.130	0.095	0.123	0.103	0.100	0.134	0.121	0.105
falcon-7b	0.048	0.046	0.051	0.046	0.051	0.052	0.050	0.077	0.096	0.112	0.129	0.141	0.142	0.124	0.103	0.107	0.198	0.200	0.205
falcon-7b-instruct	0.078	0.095	0.106	0.114	0.095	0.079	0.104	0.075	0.083	0.087	0.060	0.133	0.123	0.160	0.203	0.156	0.141	0.167	0.152
falcon-40b	0.038	0.043	0.077	0.085	0.090	0.129	0.087	0.069	0.056	0.053	0.065	0.063	0.058	0.059	0.077	0.066	0.085	0.063	0.076
falcon-40b-instruct	0.126	0.123	0.121	0.070	0.080	0.068	0.141	0.103	0.085	0.079	0.115	0.082	0.081	0.118	0.143	0.124	0.083	0.108	0.104
	Instruction For Ranking
ChatGPT	0.240	0.298	0.371	0.286	0.289	0.360	0.290	0.218	0.352	0.414	0.266	0.418	0.487	0.217	0.361	0.428	0.305	0.452	0.517
GPT-4	0.402	0.415	0.517	0.413	0.410	0.486	0.404	0.392	0.429	0.490	0.453	0.496	0.565	0.396	0.434	0.495	0.463	0.506	0.576
	Statistic
Performance-Average	0.120	0.117	0.125	0.113	0.114	0.124	0.111	0.140	0.140	0.145	0.144	0.148	0.152	0.145	0.145	0.150	0.156	0.157	0.166
Performance-Variance	0.062	0.068	0.087	0.067	0.065	0.078	0.064	0.058	0.070	0.082	0.067	0.082	0.095	0.067	0.080	0.090	0.078	0.092	0.104

Papers

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment,
by Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity,
by Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji et al.
Is ChatGPT a General-Purpose Natural Language Processing Task Solver?,
by Qin, Chengwei, Zhang, Aston, Zhang, Zhuosheng, Chen, Jiaao, Yasunaga, Michihiro and Yang, Diyi
ChatGPT versus Traditional Question Answering for Knowledge Graphs: Current Status and Future Directions Towards Knowledge Graph Chatbots,
by Reham Omar, Omij Mangukiya, Panos Kalnis and Essam Mansour
Mathematical Capabilities of ChatGPT,
by Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier and Julius Berner
Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization,
by Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen and Wei Cheng
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective,
by Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang et al.
ChatGPT is not all you need. A State of the Art Review of large Generative AI models,
by Roberto Gozalo-Brizuela and Eduardo C. Garrido-Merch'an
Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT,
by Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du and Dacheng Tao
Evaluation of ChatGPT as a Question Answering System for Answering Complex Questions,
by Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Yongrui Chen and Guilin Qi
ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models,
by Ning Bian, Xianpei Han, Le Sun, Hongyu Lin, Yaojie Lu and Ben He
Holistic Evaluation of Language Models,
by Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan et al.
Evaluating the Text-to-SQL Capabilities of Large Language Models,
by Nitarshan Rajkumar, Raymond Li and Dzmitry Bahdanau
Are Visual-Linguistic Models Commonsense Knowledge Bases?,
by Hsiu-Yu Yang and Carina Silberer
Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological Perspective,
by Xingxuan Li, Yutong Li, Linlin Liu, Lidong Bing and Shafiq R. Joty
GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models,
by Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li and Kai-Wei Chang
RobustLR: A Diagnostic Benchmark for Evaluating Logical Robustness of Deductive Reasoners,
by Soumya Sanyal, Zeyi Liao and Xiang Ren
A Systematic Evaluation of Large Language Models of Code,
by Frank F. Xu, Uri Alon, Graham Neubig and Vincent J. Hellendoorn
Evaluating Large Language Models Trained on Code,
by Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond'e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda et al.
GLGE: A New General Language Generation Evaluation Benchmark,
by Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu et al.
Evaluating Pre-Trained Models for User Feedback Analysis in Software Engineering: A Study on Classification of App-Reviews,
by Mohammad Abdul Hadi and Fatemeh H. Fard
Do Language Models Perform Generalizable Commonsense Inference?,
by Peifeng Wang, Filip Ilievski, Muhao Chen and Xiang Ren
RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms,
by Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara and Xiang Ren
Evaluation of Text Generation: A Survey,
by Asli Celikyilmaz, Elizabeth Clark and Jianfeng Gao
Neural Language Generation: Formulation, Methods, and Evaluation,
by Cristina Garbacea and Qiaozhu Mei
BERTScore: Evaluating Text Generation with BERT,
by Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger and Yoav Artzi

LLM List

Pre-trained LLM

Model	Size	Architecture	Access	Date	Origin
Switch Transformer	1.6T	Decoder(MOE)	-	2021-01	Paper
GLaM	1.2T	Decoder(MOE)	-	2021-12	Paper
PaLM	540B	Decoder	-	2022-04	Paper
MT-NLG	530B	Decoder	-	2022-01	Paper
J1-Jumbo	178B	Decoder	api	2021-08	Paper
OPT	175B	Decoder	api \| ckpt	2022-05	Paper
BLOOM	176B	Decoder	api \| ckpt	2022-11	Paper
GPT 3.0	175B	Decoder	api	2020-05	Paper
LaMDA	137B	Decoder	-	2022-01	Paper
GLM	130B	Decoder	ckpt	2022-10	Paper
YaLM	100B	Decoder	ckpt	2022-06	Blog
LLaMA	65B	Decoder	ckpt	2022-09	Paper
GPT-NeoX	20B	Decoder	ckpt	2022-04	Paper
UL2	20B	agnostic	ckpt	2022-05	Paper
鹏程.盘古α	13B	Decoder	ckpt	2021-04	Paper
T5	11B	Encoder-Decoder	ckpt	2019-10	Paper
CPM-Bee	10B	Decoder	api	2022-10	Paper
rwkv-4	7B	RWKV	ckpt	2022-09	Github
GPT-J	6B	Decoder	ckpt	2022-09	Github
GPT-Neo	2.7B	Decoder	ckpt	2021-03	Github
GPT-Neo	1.3B	Decoder	ckpt	2021-03	Github

Instruction finetuned LLM

Model	Size	Architecture	Access	Date	Origin
Flan-PaLM	540B	Decoder	-	2022-10	Paper
BLOOMZ	176B	Decoder	ckpt	2022-11	Paper
InstructGPT	175B	Decoder	api	2022-03	Paper
Galactica	120B	Decoder	ckpt	2022-11	Paper
OpenChatKit	20B	-	ckpt	2023-3	-
Flan-UL2	20B	Decoder	ckpt	2023-03	Blog
Gopher	-	-	-	-	-
Chinchilla	-	-	-	-	-
Flan-T5	11B	Encoder-Decoder	ckpt	2022-10	Paper
T0	11B	Encoder-Decoder	ckpt	2021-10	Paper
Alpaca	7B	Decoder	demo	2023-03	Github

Aligned LLM

Model	Size	Architecture	Access	Date	Origin
GPT 4	-	-	-	2023-03	Blog
ChatGPT	-	Decoder	demo\|api	2022-11	Blog
Sparrow	70B	-	-	2022-09	Paper
Claude	-	-	demo\|api	2023-03	Blog

Open LLM

LLaMA - A foundational, 65-billion-parameter large language model. LLaMA.cpp Lit-LLaMA
- Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca.cpp Alpaca-LoRA
- Flan-Alpaca - Instruction Tuning from Humans and Machines.
- Baize - Baize is an open-source chat model trained with LoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself.
- Cabrita - A portuguese finetuned instruction LLaMA.
- Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
- Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
- Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
- GPTQ-for-LLaMA - 4 bits quantization of LLaMA using GPTQ.
- GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
- Koala - A Dialogue Model for Academic Research
- BELLE - Be Everyone's Large Language model Engine
- StackLLaMA - A hands-on guide to train LLaMA with RLHF.
- RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
- Chimera - Latin Phoenix.
BLOOM - BigScience Large Open-science Open-access Multilingual Language Model BLOOM-LoRA
- BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
- Phoenix
T5 - Text-to-Text Transfer Transformer
- T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization
OPT - Open Pre-trained Transformer Language Models.
UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.
GLM- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
- ChatGLM-6B: ChatGLM-6B is an open-source bilingual conversation language model that supports both Chinese and English. It's built upon the General Language Model (GLM) architecture and has 6.2 billion parameters.
- ChatGLM2-6B: The second-generation version of the open-source bilingual dialogue model ChatGLM-6B. ChatGLM2-6B retains the excellent features of the first-generation model, such as smooth conversations and low deployment thresholds, while introducing longer context, better performance, and more efficient inference. This project is licensed under the MIT License.
RWKV - Parallelizable RNN with Transformer-level LLM Performance.
- ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
StableLM - Stability AI Language Models.
YaLM - a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.
GPT-Neo - An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.
GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile.
- Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
Pythia - Interpreting Autoregressive Transformers Across Time and Scale
- Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.
Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.
GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.
- GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
Palmyra - Palmyra Base was primarily pre-trained with English text.
Camel - a state-of-the-art instruction-following large language model designed to deliver exceptional performance and versatility.
h2oGPT
PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model develped by Huawei Noah's Ark Lab, MindSpore Team and Peng Cheng Laboratory.
MOSS - MOSS是一个支持中英双语和多种插件的开源对话语言模型.
Open-Assistant - a project meant to give everyone access to a great chat based large language model.
- HuggingChat - Powered by Open Assistant's latest model – the best open source chat model right now and @huggingface Inference API.
- Baichuan - An open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following Baichuan-7B, containing 13 billion parameters. (20230715)
- - Qwen - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. (20230803)

Popular LLM

Model	#Author	#Link	#Parameter	Base Model	#Layer	#Encoder	#Decoder	#Pretrain Tokens	#IFT Sample	RLHF
GPT3-Ada	brown2020language	https://platform.openai.com/docs/models/gpt-3	0.35B	-	24	-	24	-	-	-
Pythia-1B	biderman2023pythia	https://huggingface.co/EleutherAI/pythia-1b	1B	-	16	-	16	300B tokens	-	-
GPT3-Babbage	brown2020language	https://platform.openai.com/docs/models/gpt-3	1.3B	-	24	-	24	-	-	-
GPT2-XL	radford2019language	https://huggingface.co/gpt2-xl	1.5B	-	48	-	48	40B tokens	-	-
BLOOM-1b7	scao2022bloom	https://huggingface.co/bigscience/bloom-1b7	1.7B	-	24	-	24	350B tokens	-	-
BLOOMZ-1b7	muennighoff2022crosslingual	https://huggingface.co/bigscience/bloomz-1b7	1.7B	BLOOM-1b7	24	-	24	-	8.39B tokens	-
Dolly-v2-3b	2023dolly	https://huggingface.co/databricks/dolly-v2-3b	2.8B	Pythia-2.8B	32	-	32	-	15K	-
Pythia-2.8B	biderman2023pythia	https://huggingface.co/EleutherAI/pythia-2.8b	2.8B	-	32	-	32	300B tokens	-	-
BLOOM-3b	scao2022bloom	https://huggingface.co/bigscience/bloom-3b	3B	-	30	-	30	350B tokens	-	-
BLOOMZ-3b	muennighoff2022crosslingual	https://huggingface.co/bigscience/bloomz-3b	3B	BLOOM-3b	30	-	30	-	8.39B tokens	-
StableLM-Base-Alpha-3B	2023StableLM	https://huggingface.co/stabilityai/stablelm-base-alpha-3b	3B	-	16	-	16	800B tokens	-	-
StableLM-Tuned-Alpha-3B	2023StableLM	https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b	3B	StableLM-Base-Alpha-3B	16	-	16	-	632K	-
ChatGLM-6B	zeng2023glm-130b,du2022glm	https://huggingface.co/THUDM/chatglm-6b	6B	-	28	28	28	1T tokens	\checkmark	\checkmark
DoctorGLM	xiong2023doctorglm	https://github.com/xionghonglin/DoctorGLM	6B	ChatGLM-6B	28	28	28	-	6.38M	-
ChatGLM-Med	ChatGLM-Med	https://github.com/SCIR-HI/Med-ChatGLM	6B	ChatGLM-6B	28	28	28	-	8K	-
GPT3-Curie	brown2020language	https://platform.openai.com/docs/models/gpt-3	6.7B	-	32	-	32	-	-	-
MPT-7B-Chat	MosaicML2023Introducing	https://huggingface.co/mosaicml/mpt-7b-chat	6.7B	MPT-7B	32	-	32	-	360K	-
MPT-7B-Instruct	MosaicML2023Introducing	https://huggingface.co/mosaicml/mpt-7b-instruct	6.7B	MPT-7B	32	-	32	-	59.3K	-
MPT-7B-StoryWriter-65k+	MosaicML2023Introducing	https://huggingface.co/mosaicml/mpt-7b-storywriter	6.7B	MPT-7B	32	-	32	-	\checkmark	-
Dolly-v2-7b	2023dolly	https://huggingface.co/databricks/dolly-v2-7b	6.9B	Pythia-6.9B	32	-	32	-	15K	-
h2ogpt-oig-oasst1-512-6.9b	2023h2ogpt	https://huggingface.co/h2oai/h2ogpt-oig-oasst1-512-6.9b	6.9B	Pythia-6.9B	32	-	32	-	398K	-
Pythia-6.9B	biderman2023pythia	https://huggingface.co/EleutherAI/pythia-6.9b	6.9B	-	32	-	32	300B tokens	-	-
Alpaca-7B	alpaca	https://huggingface.co/tatsu-lab/alpaca-7b-wdiff	7B	LLaMA-7B	32	-	32	-	52K	-
Alpaca-LoRA-7B	2023alpacalora	https://huggingface.co/tloen/alpaca-lora-7b	7B	LLaMA-7B	32	-	32	-	52K	-
Baize-7B	xu2023baize	https://huggingface.co/project-baize/baize-lora-7B	7B	LLaMA-7B	32	-	32	-	263K	-
Baize Healthcare-7B	xu2023baize	https://huggingface.co/project-baize/baize-healthcare-lora-7B	7B	LLaMA-7B	32	-	32	-	201K	-
ChatDoctor	yunxiang2023chatdoctor	https://github.com/Kent0n-Li/ChatDoctor	7B	LLaMA-7B	32	-	32	-	167K	-
HuaTuo	wang2023huatuo	https://github.com/scir-hi/huatuo-llama-med-chinese	7B	LLaMA-7B	32	-	32	-	8K	-
Koala-7B	koala_blogpost_2023	https://huggingface.co/young-geng/koala	7B	LLaMA-7B	32	-	32	-	472K	-
LLaMA-7B	touvron2023llama	https://huggingface.co/decapoda-research/llama-7b-hf	7B	-	32	-	32	1T tokens	-	-
Luotuo-lora-7b-0.3	luotuo	https://huggingface.co/silk-road/luotuo-lora-7b-0.3	7B	LLaMA-7B	32	-	32	-	152K	-
StableLM-Base-Alpha-7B	2023StableLM	https://huggingface.co/stabilityai/stablelm-base-alpha-7b	7B	-	16	-	16	800B tokens	-	-
StableLM-Tuned-Alpha-7B	2023StableLM	https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b	7B	StableLM-Base-Alpha-7B	16	-	16	-	632K	-
Vicuna-7b-delta-v1.1	vicuna2023	https://github.com/lm-sys/FastChat\#vicuna-weights	7B	LLaMA-7B	32	-	32	-	70K	-
BELLE-7B-0.2M /0.6M /1M /2M	belle2023exploring	https://huggingface.co/BelleGroup/BELLE-7B-2M	7.1B	Bloomz-7b1-mt	30	-	30	-	0.2M/0.6M/1M/2M	-
BLOOM-7b1	scao2022bloom	https://huggingface.co/bigscience/bloom-7b1	7.1B	-	30	-	30	350B tokens	-	-
BLOOMZ-7b1 /mt /p3	muennighoff2022crosslingual	https://huggingface.co/bigscience/bloomz-7b1-p3	7.1B	BLOOM-7b1	30	-	30	-	4.19B tokens	-
Dolly-v2-12b	2023dolly	https://huggingface.co/databricks/dolly-v2-12b	12B	Pythia-12B	36	-	36	-	15K	-
h2ogpt-oasst1-512-12b	2023h2ogpt	https://huggingface.co/h2oai/h2ogpt-oasst1-512-12b	12B	Pythia-12B	36	-	36	-	94.6K	-
Open-Assistant-SFT-4-12B	2023openassistant	https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5	12B	Pythia-12B-deduped	36	-	36	-	161K	-
Pythia-12B	biderman2023pythia	https://huggingface.co/EleutherAI/pythia-12b	12B	-	36	-	36	300B tokens	-	-
Baize-13B	xu2023baize	https://huggingface.co/project-baize/baize-lora-13B	13B	LLaMA-13B	40	-	40	-	263K	-
Koala-13B	koala_blogpost_2023	https://huggingface.co/young-geng/koala	13B	LLaMA-13B	40	-	40	-	472K	-
LLaMA-13B	touvron2023llama	https://huggingface.co/decapoda-research/llama-13b-hf	13B	-	40	-	40	1T tokens	-	-
StableVicuna-13B	2023StableLM	https://huggingface.co/CarperAI/stable-vicuna-13b-delta	13B	Vicuna-13B v0	40	-	40	-	613K	\checkmark
Vicuna-13b-delta-v1.1	vicuna2023	https://github.com/lm-sys/FastChat\#vicuna-weights	13B	LLaMA-13B	40	-	40	-	70K	-
moss-moon-003-sft	2023moss	https://huggingface.co/fnlp/moss-moon-003-sft	16B	moss-moon-003-base	34	-	34	-	1.1M	-
moss-moon-003-sft-plugin	2023moss	https://huggingface.co/fnlp/moss-moon-003-sft-plugin	16B	moss-moon-003-base	34	-	34	-	1.4M	-
GPT-NeoX-20B	gptneox	https://huggingface.co/EleutherAI/gpt-neox-20b	20B	-	44	-	44	825GB	-	-
h2ogpt-oasst1-512-20b	2023h2ogpt	https://huggingface.co/h2oai/h2ogpt-oasst1-512-20b	20B	GPT-NeoX-20B	44	-	44	-	94.6K	-
Baize-30B	xu2023baize	https://huggingface.co/project-baize/baize-lora-30B	33B	LLaMA-30B	60	-	60	-	263K	-
LLaMA-30B	touvron2023llama	https://huggingface.co/decapoda-research/llama-30b-hf	33B	-	60	-	60	1.4T tokens	-	-
LLaMA-65B	touvron2023llama	https://huggingface.co/decapoda-research/llama-65b-hf	65B	-	80	-	80	1.4T tokens	-	-
GPT3-Davinci	brown2020language	https://platform.openai.com/docs/models/gpt-3	175B	-	96	-	96	300B tokens	-	-
BLOOM	scao2022bloom	https://huggingface.co/bigscience/bloom	176B	-	70	-	70	366B tokens	-	-
BLOOMZ /mt /p3	muennighoff2022crosslingual	https://huggingface.co/bigscience/bloomz-p3	176B	BLOOM	70	-	70	-	2.09B tokens	-
ChatGPT~(2023.05.01)	openaichatgpt	https://platform.openai.com/docs/models/gpt-3-5	-	GPT-3.5	-	-	-	-	\checkmark	\checkmark
GPT-4~(2023.05.01)	openai2023gpt4	https://platform.openai.com/docs/models/gpt-4	-	-	-	-	-	-	\checkmark	\checkmark

Others

Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.

Other Awesome Lists

Awesome LLM - A curated list of papers about large language models.
Awesome ChatGPT Prompts - A collection of prompt examples to be used with the ChatGPT model.
awesome-chatgpt-prompts-zh - A Chinese collection of prompt examples to be used with the ChatGPT model.
Awesome ChatGPT - Curated list of resources for ChatGPT and GPT-3 from OpenAI.
Chain-of-Thoughts Papers - A trend starts from "Chain of Thought Prompting Elicits Reasoning in Large Language Models.
Instruction-Tuning-Papers - A trend starts from Natrural-Instruction (ACL 2022), FLAN (ICLR 2022) and T0 (ICLR 2022).
LLM Reading List - A paper & resource list of large language models.
Reasoning using Language Models - Collection of papers and resources on Reasoning using Language Models.
Chain-of-Thought Hub - Measuring LLMs' Reasoning Performance
Awesome GPT - A curated list of awesome projects and resources related to GPT, ChatGPT, OpenAI, LLM, and more.
Awesome GPT-3 - a collection of demos and articles about the OpenAI GPT-3 API.

Licenses

This project follows the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation

If our project is helpful to you, please cite our project.

@misc{junwang2023,
  author = {Jun Wang, Changyu Hou, Xiaorui Wang, Pengyong Li, Jingjing Gong, Chen Song, Peng Gao, Qi Shen, Guotong Xie},
  title = {Awesome-LLM-Eval: a curated list of tools, benchmarks, demos, papers for Large Language Models Evaluation},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/onejune2018/Awesome-LLM-Eval}},
}

Author's bio:

Intro: Responsible for AI Platform Algorithm R&D at PA, former work at IBM, PKU, CAS, ETH
Research: Graph/CV, DL, LLM, Remote Sensing, etc. Co-first author of the Large Graph Model (MPG) for Drug Discovery
Honors: First place in several international competitions such as SemEval2022, MIT AI-Cure, VQA2021, TREC2021, and EAD2019
Homepage: https://onejune2018.github.io/homepage/
Google Scholar: https://scholar.google.com/citations?user=0Be01PgAAAAJ&hl=en

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_EN.md

README_EN.md

Awesome-LLM-Eval

Table of Contents

News

Tools

Datasets / Benchmark

Demos

Leaderborads

Performance from XieZhi-202306

Papers

LLM List

Pre-trained LLM

Instruction finetuned LLM

Aligned LLM

Open LLM

Popular LLM

Others

Other Awesome Lists

Licenses

Citation

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Awesome-LLM-Eval

Table of Contents

News

Tools

Datasets / Benchmark

Demos

Leaderborads

Performance from XieZhi-202306

Papers

LLM List

Pre-trained LLM

Instruction finetuned LLM

Aligned LLM

Open LLM

Popular LLM

Others

Other Awesome Lists

Licenses

Citation