Skip to content

Latest commit

 

History

History
476 lines (419 loc) · 107 KB

README_EN.md

File metadata and controls

476 lines (419 loc) · 107 KB

Awesome-LLM-Eval

English | 中文

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, learderboard, papers, docs and models, mainly for Evaluation on Large Language Models and exploring the boundaries and limits of Generative AI.

Table of Contents

News

Tools

Name Institute Link Date
LLM Comparator Google LLM Comparator a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. (2024-02-16)
EVAL OPENAI https://github.com/openai/evals
lm-evaluation-harness EleutherAI lm-evaluation-harness
Large language model evaluation and workflow framework from Phase AI wgryc phasellm
Evaluation benchmark for large language models FreedomIntelligence LLMZoo
Holistic Evaluation of Language Models (HELM) Stanford HELM
A lightweight evaluation tool for question-answering Langchain auto-evaluator
PandaLM: ReProducible and Automated Language Model Assessment WeOpenML PandaLM
FlagEval Tsinghua University FlagEval
AlpacaEval tatsu-lab AlpacaEval

Datasets / Benchmark

Data Name Institution Website Description
TrustLLM Benchmark TrustLLM TrustLLM TrustLLM is a benchmark for assessing the trustworthiness of large language models. This benchmark encompasses six dimensions of trustworthiness and includes over 30 datasets to comprehensively evaluate the capabilities of LLMs, ranging from simple classification to complex generation tasks. Each dataset presents unique challenges and has been used to benchmark 16 mainstream large language models, including both commercial and open-source models, across multiple dimensions of trustworthiness.
M3Exam DAMO M3Exam A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models.
KoLA THU-KEG KoLA Knowledge-oriented LLM Assessment benchmark (KoLA), hosted by Knowledge Engineering Group, Tsinghua University (THU-KEG), aims to benchmark LLMs' world knowledge by meticulously designing data, ability taxonomy, and evaluation metrics.
promptbench Microsoft promptbench PromptBench is a powerful tool to scrutinize and analyze large language models' interaction with prompts. It simulates black-box adversarial prompt attacks and evaluates model performance. The repository provides code, datasets, and instructions for experiments.
OpenCompass Shanghai AI Lab OpenCompass OpenCompass is an LLM evaluation platform supporting 20+ models over 50+ datasets for comprehensive benchmarking using efficient distributed evaluation techniques.
JioNLP-LLM Evaluation Dataset jionlp JioNLP-LLM Evaluation Dataset The JioNLP-LLM Evaluation Dataset is used to evaluate general LLM performance, focusing on their assistance to users and whether they reach the level of a "smart assistant." It includes multiple-choice questions from various professional exams and subjective questions to assess common LLM functions.
BIG-bench Google BIG-bench BIG bench consists of 204 tasks spanning linguistic, childhood development, mathematical, commonsense reasoning, biological, physical, societal bias, and software development domains.
BIG-Bench-Hard Stanford NLP BIG-Bench-Hard BIG-Bench-Hard (BBH) contains 23 challenging tasks, where prior model evaluations didn't surpass human-rater performance.
SuperCLUE CLUEbenchmark SuperCLUE A Chinese benchmark covering basic, professional, and Chinese-specific abilities with a variety of tasks in semantic understanding, dialogue, logic reasoning, role simulation, coding, and more.
Safety Eval Tsinghua University Safety Eval - Safety Large Model Evaluation An evaluation set by Tsinghua University covering hate speech, prejudice, crime, privacy, ethics, and more, categorized into 40+ safety categories.
GAOKAO-Bench OpenLMLab GAOKAO-Bench GAOKAO-bench evaluates the language understanding and logical reasoning abilities of large models using Chinese college entrance examination questions.
Gaokao ExpressAI Gaokao "GaoKao Benchmark" aims to assess and track our progress in achieving human-level intelligence. It provides a comprehensive evaluation of various tasks and domains for comparison with human performance.
MMLU paperswithcode.com MMLU The MMLU evaluation dataset covers 57 subjects in STEM, humanities, and social sciences, ranging from elementary to professional levels.
CMMLU MBZUAI & ShangHai JiaoTong & Microsoft CMMLU Measuring massive multitask language understanding in Chinese
MMCU Oracle AI Research MMCU MMCU evaluates Chinese large models' performance in medical, legal, psychological, and educational domains.
AGIEval Microsoft Research AGIEval AGIEval comprehensively evaluates base models' cognitive and problem-solving abilities using various official entrance and professional qualification exams.
C_Eval SJTU, Tsinghua, University of Edinburgh C_Eval C_Eval evaluates models' higher-level knowledge and reasoning abilities across 52 disciplines.
XieZhi Fudan University XieZhi XieZhi is a comprehensive evaluation suite for Language Models, spanning various disciplines and difficulty levels.
MT-bench Multiple Universities MT-bench MT-bench is a benchmark with 80 high-quality multi-turn questions designed to test multi-turn conversation and instruction-following ability.
GLUE Benchmark Multiple Institutions GLUE Benchmark GLUE Benchmark evaluates models' performance in various tasks like grammar, paraphrasing, text similarity, inference, textual entailment, and pronoun resolution.
OpenAI Moderation API OpenAI OpenAI Moderation API Filters harmful or unsafe content.
GSM8K OpenAI GSM8K GSM8K is a dataset of linguistically diverse grade school math word problems, testing mathematical problem-solving abilities.
EleutherAI LM Eval EleutherAI EleutherAI LM Eval Evaluates model performance with few-shot tasks and fine-tuning across multiple tasks.
OpenAI Evals OpenAI OpenAI Evals Evaluates generated text for accuracy, diversity, consistency, robustness, transferability, efficiency, and fairness.
AlpacaEval tatsu-lab AlpacaEval An automatic evaluation based on AlpacaFarm evaluation set, comparing responses with reference answers.
Adversarial NLI (ANLI) Facebook AI Research, others Adversarial NLI (ANLI) Evaluates model robustness, generalization, inference explanations, and efficiency under adversarial samples.
LIT (Language Interpretability Tool) Google LIT Provides a platform to evaluate, analyze model strengths, weaknesses, and potential biases based on user-defined metrics.
ParlAI Facebook AI Research ParlAI Evaluates model performance in terms of accuracy, F1 score, perplexity, human ratings, speed, robustness, and generalization.
CoQA Stanford NLP Group CoQA Evaluates models' comprehension of paragraphs and answering related questions in a conversational context.
LAMBADA University of Trento, Fondazione Bruno Kessler LAMBADA Measures models' long-term understanding by predicting the last word of paragraphs.
HellaSwag University of Washington, Allen Institute for AI HellaSwag Evaluates models' reasoning abilities using counterfactual statements.
LogiQA Tsinghua University, Microsoft Research Asia LogiQA Evaluates models' logical reasoning abilities.
MultiNLI Multiple Institutions MultiNLI Evaluates models' ability to understand relationships between sentences from different genres.
SQUAD Stanford NLP Group SQUAD Evaluates models' reading comprehension abilities.
Open LLM Leaderboard HuggingFace Leaderboard HuggingFace's LLM evaluation leaderboard covering AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA datasets.
chinese-llm-benchmark jeinlee1991 llm-benchmark Chinese LLM benchmark covering various open-source models and multidimensional evaluations.
AlpacaEval tatsu-lab AlpacaEval LLM-based automatic evaluation for open-source models' performance.
Huggingface Open LLM Leaderboard huggingface HF Open LLM Leaderboard Evaluates open-source models on four evaluation sets: AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA.
lmsys-arena Berkeley lmsys Ranking Rankings based on the Elo rating mechanism: GPT4 > Claude > GPT3.5 > Vicuna > others.
CMU Open-Source Chatbot Evaluation CMU zeno-build Evaluates models in dialogue scenarios, ranking ChatGPT > Vicuna > others.
Z-Bench Chinese ZhenFund Evaluation ZhenFund Z-Bench Evaluates Chinese models, with minor differences; improvements in ChatGLM 6B versions.
Chain-of-thought Evaluation Yao Fu COT Evaluation Rankings include GSM8k, MATH, and complex problems.
InfoQ Large Model Comprehensive Evaluation InfoQ InfoQ Evaluation Chinese-based ranking including ChatGPT, 文心一言, Claude, and 星火.
ToolBench Tool Invocation Evaluation Zhiguan/ClearHu ToolBench Compares models' performance with tool-tuned models and ChatGPT.
AgentBench Inference Decision Evaluation THUDM AgentBench Evaluates models' inference and decision-making abilities in various scenarios like shopping, home, and operating systems.
FlagEval Zhiguan/ClearHu FlagEval Provides LLM ranking using subjective and objective scores.
ChatEval THU-NLP ChatEval Simplifies human evaluation of generated text by involving human raters in discussions.
Zhujiu Institute of Automation, CAS Zhujiu Multidimensional evaluation covering 7 ability dimensions and 51 tasks in both Chinese and English.
LucyEval Oracle LucyEval Evaluates Chinese large models' maturity using objective tests across various abilities.
Do-Not-Answer Libr-AI Do-Not-Answer Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.
ColossalEval Colossal-AI ColossalEval ColossalEval is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs.
SmartPlay microsoft SmartPlay SmartPlay is a benchmark for Large Language Models (LLMs). It is designed to be easy to use, and to provide a wide variety of games to test agents on.
LVLM-eHub OpenGVLab LVLM-eHub Multi-Modality Arena is an evaluation platform for large multi-modality models. Following Fastchat, two anonymous models side-by-side are compared on a visual question-answering task. The Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more.
BLURB Mindrank AI BLURB BLURB comprises of a comprehensive benchmark for PubMed-based biomedical NLP applications, as well as a leaderboard for tracking progress by the community. BLURB includes thirteen publicly available datasets in six diverse tasks. To avoid placing undue emphasis on tasks with many available datasets, such as named entity recognition (NER), BLURB reports the macro average across all tasks as the main score. The BLURB leaderboard is model-agnostic. Any system capable of producing the test predictions using the same training and development data can participate.
SWE-bench princeton-nlp SWE-bench SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.

Demos

  • Chat Arena: anonymous models side-by-side and vote for which one is better - Open Source AI "Anonymous" Arena! Here, you can become a referee and rate the responses of two models whose names you don't know in advance. After scoring, their real identities will be revealed. The current "participants" include Vicuna, Koala, OpenAssistant (oasst), Dolly, ChatGLM, StableLM, Alpaca, LLaMA, and more.

Leaderborads

Performance from XieZhi-202306

Models MMLU CEval M3KE Xiezhi-Spec.-Chinese Xiezhi-Inter.-Chinese Xiezhi-Spec.-English Xiezhi-Inter.-English
0-shot 1-shot 3-shot 0-shot 1-shot 3-shot 0-shot 0-shot 1-shot 3-shot 0-shot 1-shot 3-shot 0-shot 1-shot 3-shot 0-shot 1-shot 3-shot
Random-Guess 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089 0.089
Generation Probability For Ranking
Bloomz-560m 0.111 0.109 0.119 0.124 0.117 0.103 0.126 0.123 0.127 0.124 0.130 0.138 0.140 0.113 0.116 0.123 0.124 0.117 0.160
Bloomz-1b1 0.131 0.116 0.128 0.107 0.115 0.110 0.082 0.138 0.108 0.107 0.117 0.125 0.123 0.130 0.119 0.114 0.144 0.129 0.145
Bloomz-1b7 0.107 0.117 0.164 0.054 0.058 0.103 0.102 0.165 0.151 0.159 0.152 0.214 0.170 0.133 0.140 0.144 0.150 0.149 0.209
Bloomz-3b 0.139 0.084 0.146 0.168 0.182 0.194 0.063 0.186 0.154 0.168 0.151 0.180 0.182 0.201 0.155 0.156 0.175 0.164 0.158
Bloomz-7b1 0.167 0.160 0.205 0.074 0.072 0.073 0.073 0.154 0.178 0.162 0.148 0.160 0.156 0.176 0.153 0.207 0.217 0.204 0.229
Bloomz-7b1-mt 0.189 0.196 0.210 0.077 0.078 0.158 0.072 0.163 0.175 0.154 0.155 0.195 0.164 0.180 0.146 0.219 0.228 0.171 0.232
Bloomz-7b1-p3 0.066 0.059 0.075 0.071 0.070 0.072 0.081 0.177 0.198 0.158 0.183 0.173 0.170 0.130 0.130 0.162 0.157 0.132 0.134
Bloomz 0.051 0.066 0.053 0.142 0.166 0.240 0.098 0.185 0.133 0.277 0.161 0.099 0.224 0.069 0.082 0.056 0.058 0.055 0.049
Bloomz-mt 0.266 0.264 0.248 0.204 0.164 0.151 0.161 0.253 0.198 0.212 0.213 0.189 0.184 0.379 0.396 0.394 0.383 0.405 0.398
Bloomz-p3 0.115 0.093 0.057 0.118 0.137 0.140 0.115 0.136 0.095 0.105 0.086 0.065 0.098 0.139 0.097 0.069 0.176 0.141 0.070
llama-7b 0.125 0.132 0.093 0.133 0.106 0.110 0.158 0.152 0.141 0.117 0.142 0.135 0.128 0.159 0.165 0.161 0.194 0.183 0.176
llama-13b 0.166 0.079 0.135 0.152 0.181 0.169 0.131 0.133 0.241 0.243 0.211 0.202 0.303 0.154 0.183 0.215 0.174 0.216 0.231
llama-30b 0.076 0.107 0.073 0.079 0.119 0.082 0.079 0.140 0.206 0.162 0.186 0.202 0.183 0.110 0.195 0.161 0.088 0.158 0.219
llama-65b 0.143 0.121 0.100 0.154 0.141 0.168 0.125 0.142 0.129 0.084 0.108 0.077 0.077 0.183 0.204 0.172 0.133 0.191 0.157
baize-7b~(lora) 0.129 0.091 0.079 0.194 0.180 0.206 0.231 0.216 0.148 0.123 0.173 0.158 0.198 0.182 0.190 0.194 0.218 0.188 0.209
baize-7b-healthcare~(lora) 0.130 0.121 0.106 0.178 0.174 0.178 0.203 0.178 0.146 0.123 0.266 0.107 0.118 0.175 0.164 0.173 0.197 0.231 0.198
baize-13b~(lora) 0.131 0.111 0.171 0.184 0.178 0.195 0.155 0.158 **0.221 ** 0.256 0.208 0.200 0.219 0.176 0.189 0.239 0.187 0.185 0.274
baize-30b~(lora) 0.193 0.216 0.207 0.191 0.196 0.121 0.071 0.109 0.212 0.190 0.203 0.256 0.200 0.167 0.235 0.168 0.072 0.180 0.193
Belle-0.2M 0.127 0.148 0.243 0.053 0.063 0.136 0.076 0.172 0.126 0.153 0.171 0.165 0.147 0.206 0.146 0.148 0.217 0.150 0.173
Belle-0.6M 0.091 0.114 0.180 0.082 0.080 0.090 0.075 0.188 0.149 0.198 0.188 0.188 0.175 0.173 0.172 0.183 0.193 0.184 0.196
Belle-1M 0.137 0.126 0.162 0.066 0.065 0.072 0.066 0.170 0.152 0.147 0.173 0.176 0.197 0.211 0.137 0.149 0.207 0.151 0.185
Belle-2M 0.127 0.148 0.132 0.058 0.063 0.136 0.057 0.163 0.166 0.130 0.159 0.177 0.163 0.155 0.106 0.166 0.151 0.150 0.138
chatglm-6B 0.099 0.109 0.112 0.084 0.074 0.114 0.115 0.082 0.097 0.147 0.104 0.111 0.144 0.106 0.120 0.124 0.099 0.079 0.097
doctorglm-6b 0.093 0.076 0.065 0.037 0.085 0.051 0.038 0.062 0.068 0.044 0.047 0.056 0.043 0.069 0.053 0.043 0.106 0.059 0.059
moss-base-16B 0.072 0.050 0.062 0.115 0.048 0.052 0.099 0.105 0.051 0.059 0.123 0.054 0.058 0.124 0.077 0.080 0.121 0.058 0.063
moss-sft-16B 0.064 0.065 0.051 0.063 0.062 0.072 0.075 0.072 0.067 0.068 0.073 0.081 0.066 0.071 0.070 0.059 0.074 0.084 0.075
vicuna-7b 0.051 0.051 0.029 0.063 0.071 0.064 0.059 0.169 0.171 0.165 0.134 0.201 0.213 0.182 0.209 0.195 0.200 0.214 0.182
vicuna-13b 0.109 0.104 0.066 0.060 0.131 0.131 0.067 0.171 0.167 0.166 0.143 0.147 0.178 0.121 0.139 0.128 0.158 0.174 0.191
alpaca-7b 0.135 0.170 0.202 0.137 0.119 0.113 0.142 0.129 0.139 0.123 0.178 0.104 0.097 0.189 0.179 0.128 0.200 0.185 0.149
pythia-1.4b 0.124 0.127 0.121 0.108 0.132 0.138 0.083 0.125 0.128 0.135 0.111 0.146 0.135 0.158 0.124 0.124 0.166 0.126 0.118
pythia-2.8b 0.103 0.110 0.066 0.064 0.089 0.122 0.086 0.114 0.120 0.131 0.091 0.113 0.112 0.126 0.118 0.112 0.110 0.145 0.107
pythia-6.9b 0.115 0.070 0.084 0.078 0.073 0.094 0.073 0.086 0.094 0.092 0.097 0.098 0.085 0.091 0.088 0.083 0.099 0.099 0.096
pythia-12b 0.075 0.059 0.066 0.077 0.097 0.078 0.098 0.102 0.126 0.132 0.125 0.147 0.159 0.079 0.098 0.110 0.094 0.120 0.120
gpt-neox-20b 0.081 0.132 0.086 0.086 0.096 0.069 0.094 0.140 0.103 0.109 0.120 0.098 0.085 0.088 0.101 0.116 0.099 0.113 0.156
h2ogpt-12b 0.075 0.087 0.078 0.080 0.078 0.094 0.070 0.065 0.047 0.073 0.076 0.061 0.091 0.088 0.050 0.065 0.105 0.063 0.067
h2ogpt-20b 0.114 0.098 0.110 0.094 0.084 0.061 0.096 0.108 0.080 0.073 0.086 0.081 0.072 0.108 0.068 0.086 0.109 0.071 0.079
dolly-3b 0.066 0.060 0.055 0.079 0.083 0.077 0.066 0.100 0.090 0.083 0.091 0.093 0.085 0.079 0.063 0.077 0.076 0.074 0.084
dolly-7b 0.095 0.068 0.052 0.091 0.079 0.070 0.108 0.108 0.089 0.092 0.111 0.095 0.100 0.096 0.059 0.086 0.123 0.085 0.090
dolly-12b 0.095 0.068 0.093 0.085 0.071 0.073 0.114 0.098 0.106 0.103 0.094 0.114 0.106 0.086 0.088 0.098 0.088 0.102 0.116
stablelm-3b 0.070 0.085 0.071 0.086 0.082 0.099 0.096 0.101 0.087 0.091 0.083 0.092 0.067 0.069 0.089 0.081 0.066 0.085 0.088
stablelm-7b 0.158 0.118 0.093 0.133 0.102 0.093 0.140 0.085 0.118 0.122 0.123 0.130 0.095 0.123 0.103 0.100 0.134 0.121 0.105
falcon-7b 0.048 0.046 0.051 0.046 0.051 0.052 0.050 0.077 0.096 0.112 0.129 0.141 0.142 0.124 0.103 0.107 0.198 0.200 0.205
falcon-7b-instruct 0.078 0.095 0.106 0.114 0.095 0.079 0.104 0.075 0.083 0.087 0.060 0.133 0.123 0.160 0.203 0.156 0.141 0.167 0.152
falcon-40b 0.038 0.043 0.077 0.085 0.090 0.129 0.087 0.069 0.056 0.053 0.065 0.063 0.058 0.059 0.077 0.066 0.085 0.063 0.076
falcon-40b-instruct 0.126 0.123 0.121 0.070 0.080 0.068 0.141 0.103 0.085 0.079 0.115 0.082 0.081 0.118 0.143 0.124 0.083 0.108 0.104
Instruction For Ranking
ChatGPT 0.240 0.298 0.371 0.286 0.289 0.360 0.290 0.218 0.352 0.414 0.266 0.418 0.487 0.217 0.361 0.428 0.305 0.452 0.517
GPT-4 0.402 0.415 0.517 0.413 0.410 0.486 0.404 0.392 0.429 0.490 0.453 0.496 0.565 0.396 0.434 0.495 0.463 0.506 0.576
Statistic
Performance-Average 0.120 0.117 0.125 0.113 0.114 0.124 0.111 0.140 0.140 0.145 0.144 0.148 0.152 0.145 0.145 0.150 0.156 0.157 0.166
Performance-Variance 0.062 0.068 0.087 0.067 0.065 0.078 0.064 0.058 0.070 0.082 0.067 0.082 0.095 0.067 0.080 0.090 0.078 0.092 0.104

Papers

LLM List

Pre-trained LLM

Model Size Architecture Access Date Origin
Switch Transformer 1.6T Decoder(MOE) - 2021-01 Paper
GLaM 1.2T Decoder(MOE) - 2021-12 Paper
PaLM 540B Decoder - 2022-04 Paper
MT-NLG 530B Decoder - 2022-01 Paper
J1-Jumbo 178B Decoder api 2021-08 Paper
OPT 175B Decoder api | ckpt 2022-05 Paper
BLOOM 176B Decoder api | ckpt 2022-11 Paper
GPT 3.0 175B Decoder api 2020-05 Paper
LaMDA 137B Decoder - 2022-01 Paper
GLM 130B Decoder ckpt 2022-10 Paper
YaLM 100B Decoder ckpt 2022-06 Blog
LLaMA 65B Decoder ckpt 2022-09 Paper
GPT-NeoX 20B Decoder ckpt 2022-04 Paper
UL2 20B agnostic ckpt 2022-05 Paper
鹏程.盘古α 13B Decoder ckpt 2021-04 Paper
T5 11B Encoder-Decoder ckpt 2019-10 Paper
CPM-Bee 10B Decoder api 2022-10 Paper
rwkv-4 7B RWKV ckpt 2022-09 Github
GPT-J 6B Decoder ckpt 2022-09 Github
GPT-Neo 2.7B Decoder ckpt 2021-03 Github
GPT-Neo 1.3B Decoder ckpt 2021-03 Github

Instruction finetuned LLM

Model Size Architecture Access Date Origin
Flan-PaLM 540B Decoder - 2022-10 Paper
BLOOMZ 176B Decoder ckpt 2022-11 Paper
InstructGPT 175B Decoder api 2022-03 Paper
Galactica 120B Decoder ckpt 2022-11 Paper
OpenChatKit 20B - ckpt 2023-3 -
Flan-UL2 20B Decoder ckpt 2023-03 Blog
Gopher - - - - -
Chinchilla - - - - -
Flan-T5 11B Encoder-Decoder ckpt 2022-10 Paper
T0 11B Encoder-Decoder ckpt 2021-10 Paper
Alpaca 7B Decoder demo 2023-03 Github

Aligned LLM

Model Size Architecture Access Date Origin
GPT 4 - - - 2023-03 Blog
ChatGPT - Decoder demo|api 2022-11 Blog
Sparrow 70B - - 2022-09 Paper
Claude - - demo|api 2023-03 Blog

Open LLM

  • LLaMA - A foundational, 65-billion-parameter large language model. LLaMA.cpp Lit-LLaMA

    • Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca.cpp Alpaca-LoRA
    • Flan-Alpaca - Instruction Tuning from Humans and Machines.
    • Baize - Baize is an open-source chat model trained with LoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself.
    • Cabrita - A portuguese finetuned instruction LLaMA.
    • Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
    • Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
    • Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
    • GPTQ-for-LLaMA - 4 bits quantization of LLaMA using GPTQ.
    • GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
    • Koala - A Dialogue Model for Academic Research
    • BELLE - Be Everyone's Large Language model Engine
    • StackLLaMA - A hands-on guide to train LLaMA with RLHF.
    • RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
    • Chimera - Latin Phoenix.
  • BLOOM - BigScience Large Open-science Open-access Multilingual Language Model BLOOM-LoRA

    • BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
    • Phoenix
  • T5 - Text-to-Text Transfer Transformer

    • T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization
  • OPT - Open Pre-trained Transformer Language Models.

  • UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.

  • GLM- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.

    • ChatGLM-6B: ChatGLM-6B is an open-source bilingual conversation language model that supports both Chinese and English. It's built upon the General Language Model (GLM) architecture and has 6.2 billion parameters.
    • ChatGLM2-6B: The second-generation version of the open-source bilingual dialogue model ChatGLM-6B. ChatGLM2-6B retains the excellent features of the first-generation model, such as smooth conversations and low deployment thresholds, while introducing longer context, better performance, and more efficient inference. This project is licensed under the MIT License.
  • RWKV - Parallelizable RNN with Transformer-level LLM Performance.

    • ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
  • StableLM - Stability AI Language Models.

  • YaLM - a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.

  • GPT-Neo - An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.

  • GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile.

    • Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
  • Pythia - Interpreting Autoregressive Transformers Across Time and Scale

    • Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
  • OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.

  • Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.

  • GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.

    • GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
  • Palmyra - Palmyra Base was primarily pre-trained with English text.

  • Camel - a state-of-the-art instruction-following large language model designed to deliver exceptional performance and versatility.

  • h2oGPT

  • PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model develped by Huawei Noah's Ark Lab, MindSpore Team and Peng Cheng Laboratory.

  • MOSS - MOSS是一个支持中英双语和多种插件的开源对话语言模型.

  • Open-Assistant - a project meant to give everyone access to a great chat based large language model.

    • HuggingChat - Powered by Open Assistant's latest model – the best open source chat model right now and @huggingface Inference API.
    • Baichuan - An open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following Baichuan-7B, containing 13 billion parameters. (20230715)
      • Qwen - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. (20230803)

Popular LLM

Model #Author #Link #Parameter Base Model #Layer #Encoder #Decoder #Pretrain Tokens #IFT Sample RLHF
GPT3-Ada brown2020language https://platform.openai.com/docs/models/gpt-3 0.35B - 24 - 24 - - -
Pythia-1B biderman2023pythia https://huggingface.co/EleutherAI/pythia-1b 1B - 16 - 16 300B tokens - -
GPT3-Babbage brown2020language https://platform.openai.com/docs/models/gpt-3 1.3B - 24 - 24 - - -
GPT2-XL radford2019language https://huggingface.co/gpt2-xl 1.5B - 48 - 48 40B tokens - -
BLOOM-1b7 scao2022bloom https://huggingface.co/bigscience/bloom-1b7 1.7B - 24 - 24 350B tokens - -
BLOOMZ-1b7 muennighoff2022crosslingual https://huggingface.co/bigscience/bloomz-1b7 1.7B BLOOM-1b7 24 - 24 - 8.39B tokens -
Dolly-v2-3b 2023dolly https://huggingface.co/databricks/dolly-v2-3b 2.8B Pythia-2.8B 32 - 32 - 15K -
Pythia-2.8B biderman2023pythia https://huggingface.co/EleutherAI/pythia-2.8b 2.8B - 32 - 32 300B tokens - -
BLOOM-3b scao2022bloom https://huggingface.co/bigscience/bloom-3b 3B - 30 - 30 350B tokens - -
BLOOMZ-3b muennighoff2022crosslingual https://huggingface.co/bigscience/bloomz-3b 3B BLOOM-3b 30 - 30 - 8.39B tokens -
StableLM-Base-Alpha-3B 2023StableLM https://huggingface.co/stabilityai/stablelm-base-alpha-3b 3B - 16 - 16 800B tokens - -
StableLM-Tuned-Alpha-3B 2023StableLM https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b 3B StableLM-Base-Alpha-3B 16 - 16 - 632K -
ChatGLM-6B zeng2023glm-130b,du2022glm https://huggingface.co/THUDM/chatglm-6b 6B - 28 28 28 1T tokens \checkmark \checkmark
DoctorGLM xiong2023doctorglm https://github.com/xionghonglin/DoctorGLM 6B ChatGLM-6B 28 28 28 - 6.38M -
ChatGLM-Med ChatGLM-Med https://github.com/SCIR-HI/Med-ChatGLM 6B ChatGLM-6B 28 28 28 - 8K -
GPT3-Curie brown2020language https://platform.openai.com/docs/models/gpt-3 6.7B - 32 - 32 - - -
MPT-7B-Chat MosaicML2023Introducing https://huggingface.co/mosaicml/mpt-7b-chat 6.7B MPT-7B 32 - 32 - 360K -
MPT-7B-Instruct MosaicML2023Introducing https://huggingface.co/mosaicml/mpt-7b-instruct 6.7B MPT-7B 32 - 32 - 59.3K -
MPT-7B-StoryWriter-65k+ MosaicML2023Introducing https://huggingface.co/mosaicml/mpt-7b-storywriter 6.7B MPT-7B 32 - 32 - \checkmark -
Dolly-v2-7b 2023dolly https://huggingface.co/databricks/dolly-v2-7b 6.9B Pythia-6.9B 32 - 32 - 15K -
h2ogpt-oig-oasst1-512-6.9b 2023h2ogpt https://huggingface.co/h2oai/h2ogpt-oig-oasst1-512-6.9b 6.9B Pythia-6.9B 32 - 32 - 398K -
Pythia-6.9B biderman2023pythia https://huggingface.co/EleutherAI/pythia-6.9b 6.9B - 32 - 32 300B tokens - -
Alpaca-7B alpaca https://huggingface.co/tatsu-lab/alpaca-7b-wdiff 7B LLaMA-7B 32 - 32 - 52K -
Alpaca-LoRA-7B 2023alpacalora https://huggingface.co/tloen/alpaca-lora-7b 7B LLaMA-7B 32 - 32 - 52K -
Baize-7B xu2023baize https://huggingface.co/project-baize/baize-lora-7B 7B LLaMA-7B 32 - 32 - 263K -
Baize Healthcare-7B xu2023baize https://huggingface.co/project-baize/baize-healthcare-lora-7B 7B LLaMA-7B 32 - 32 - 201K -
ChatDoctor yunxiang2023chatdoctor https://github.com/Kent0n-Li/ChatDoctor 7B LLaMA-7B 32 - 32 - 167K -
HuaTuo wang2023huatuo https://github.com/scir-hi/huatuo-llama-med-chinese 7B LLaMA-7B 32 - 32 - 8K -
Koala-7B koala_blogpost_2023 https://huggingface.co/young-geng/koala 7B LLaMA-7B 32 - 32 - 472K -
LLaMA-7B touvron2023llama https://huggingface.co/decapoda-research/llama-7b-hf 7B - 32 - 32 1T tokens - -
Luotuo-lora-7b-0.3 luotuo https://huggingface.co/silk-road/luotuo-lora-7b-0.3 7B LLaMA-7B 32 - 32 - 152K -
StableLM-Base-Alpha-7B 2023StableLM https://huggingface.co/stabilityai/stablelm-base-alpha-7b 7B - 16 - 16 800B tokens - -
StableLM-Tuned-Alpha-7B 2023StableLM https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b 7B StableLM-Base-Alpha-7B 16 - 16 - 632K -
Vicuna-7b-delta-v1.1 vicuna2023 https://github.com/lm-sys/FastChat\#vicuna-weights 7B LLaMA-7B 32 - 32 - 70K -
BELLE-7B-0.2M /0.6M /1M /2M belle2023exploring https://huggingface.co/BelleGroup/BELLE-7B-2M 7.1B Bloomz-7b1-mt 30 - 30 - 0.2M/0.6M/1M/2M -
BLOOM-7b1 scao2022bloom https://huggingface.co/bigscience/bloom-7b1 7.1B - 30 - 30 350B tokens - -
BLOOMZ-7b1 /mt /p3 muennighoff2022crosslingual https://huggingface.co/bigscience/bloomz-7b1-p3 7.1B BLOOM-7b1 30 - 30 - 4.19B tokens -
Dolly-v2-12b 2023dolly https://huggingface.co/databricks/dolly-v2-12b 12B Pythia-12B 36 - 36 - 15K -
h2ogpt-oasst1-512-12b 2023h2ogpt https://huggingface.co/h2oai/h2ogpt-oasst1-512-12b 12B Pythia-12B 36 - 36 - 94.6K -
Open-Assistant-SFT-4-12B 2023openassistant https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 12B Pythia-12B-deduped 36 - 36 - 161K -
Pythia-12B biderman2023pythia https://huggingface.co/EleutherAI/pythia-12b 12B - 36 - 36 300B tokens - -
Baize-13B xu2023baize https://huggingface.co/project-baize/baize-lora-13B 13B LLaMA-13B 40 - 40 - 263K -
Koala-13B koala_blogpost_2023 https://huggingface.co/young-geng/koala 13B LLaMA-13B 40 - 40 - 472K -
LLaMA-13B touvron2023llama https://huggingface.co/decapoda-research/llama-13b-hf 13B - 40 - 40 1T tokens - -
StableVicuna-13B 2023StableLM https://huggingface.co/CarperAI/stable-vicuna-13b-delta 13B Vicuna-13B v0 40 - 40 - 613K \checkmark
Vicuna-13b-delta-v1.1 vicuna2023 https://github.com/lm-sys/FastChat\#vicuna-weights 13B LLaMA-13B 40 - 40 - 70K -
moss-moon-003-sft 2023moss https://huggingface.co/fnlp/moss-moon-003-sft 16B moss-moon-003-base 34 - 34 - 1.1M -
moss-moon-003-sft-plugin 2023moss https://huggingface.co/fnlp/moss-moon-003-sft-plugin 16B moss-moon-003-base 34 - 34 - 1.4M -
GPT-NeoX-20B gptneox https://huggingface.co/EleutherAI/gpt-neox-20b 20B - 44 - 44 825GB - -
h2ogpt-oasst1-512-20b 2023h2ogpt https://huggingface.co/h2oai/h2ogpt-oasst1-512-20b 20B GPT-NeoX-20B 44 - 44 - 94.6K -
Baize-30B xu2023baize https://huggingface.co/project-baize/baize-lora-30B 33B LLaMA-30B 60 - 60 - 263K -
LLaMA-30B touvron2023llama https://huggingface.co/decapoda-research/llama-30b-hf 33B - 60 - 60 1.4T tokens - -
LLaMA-65B touvron2023llama https://huggingface.co/decapoda-research/llama-65b-hf 65B - 80 - 80 1.4T tokens - -
GPT3-Davinci brown2020language https://platform.openai.com/docs/models/gpt-3 175B - 96 - 96 300B tokens - -
BLOOM scao2022bloom https://huggingface.co/bigscience/bloom 176B - 70 - 70 366B tokens - -
BLOOMZ /mt /p3 muennighoff2022crosslingual https://huggingface.co/bigscience/bloomz-p3 176B BLOOM 70 - 70 - 2.09B tokens -
ChatGPT~(2023.05.01) openaichatgpt https://platform.openai.com/docs/models/gpt-3-5 - GPT-3.5 - - - - \checkmark \checkmark
GPT-4~(2023.05.01) openai2023gpt4 https://platform.openai.com/docs/models/gpt-4 - - - - - - \checkmark \checkmark

Others

Other Awesome Lists

Licenses

MIT license

CC BY-NC-SA 4.0

This project follows the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation

If our project is helpful to you, please cite our project.

@misc{junwang2023,
  author = {Jun Wang, Changyu Hou, Xiaorui Wang, Pengyong Li, Jingjing Gong, Chen Song, Peng Gao, Qi Shen, Guotong Xie},
  title = {Awesome-LLM-Eval: a curated list of tools, benchmarks, demos, papers for Large Language Models Evaluation},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/onejune2018/Awesome-LLM-Eval}},
}

Author's bio: