Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, learderboard, papers, docs and models, mainly for Evaluation on Large Language Models and exploring the boundaries and limits of Generative AI.
- News
- Tools
- Datasets / Benchmark
- Demos
- Leaderboards
- Papers
- LLM-List
- LLMOps
- Frameworks for Training
- Courses
- Others
- Other-Awesome-Lists
- Licenses
- Citation
-
[2023/09/25] We add ColossalEval from Colossal-AI.
-
[2023/09/22] We add Leaderboard chapter.
-
[2023/09/20] We add DeepEval, FinEval and SuperCLUE-Safety from CLUEbenchmark.
-
[2023/09/18] We add OpenCompass from Shanghai AI Lab.
-
[2023/06/28] We add AlpacaEval and multiple tools.
-
[2023/04/26] We released the V0.1 Eval list with multiple benchmarks, etc.
Name | Institute | Link | Date |
---|---|---|---|
LLM Comparator | LLM Comparator | a novel visual analytics tool for interactively analyzing results from automatic side-by-side evaluation. The tool supports interactive workflows for users to understand when and why a model performs better or worse than a baseline model, and how the responses from two models are qualitatively different. (2024-02-16) | |
EVAL | OPENAI | https://github.com/openai/evals | |
lm-evaluation-harness | EleutherAI | lm-evaluation-harness | |
Large language model evaluation and workflow framework from Phase AI | wgryc | phasellm | |
Evaluation benchmark for large language models | FreedomIntelligence | LLMZoo | |
Holistic Evaluation of Language Models (HELM) | Stanford | HELM | |
A lightweight evaluation tool for question-answering | Langchain | auto-evaluator | |
PandaLM: ReProducible and Automated Language Model Assessment | WeOpenML | PandaLM | |
FlagEval | Tsinghua University | FlagEval | |
AlpacaEval | tatsu-lab | AlpacaEval |
Data Name | Institution | Website | Description |
---|---|---|---|
TrustLLM Benchmark | TrustLLM | TrustLLM | TrustLLM is a benchmark for assessing the trustworthiness of large language models. This benchmark encompasses six dimensions of trustworthiness and includes over 30 datasets to comprehensively evaluate the capabilities of LLMs, ranging from simple classification to complex generation tasks. Each dataset presents unique challenges and has been used to benchmark 16 mainstream large language models, including both commercial and open-source models, across multiple dimensions of trustworthiness. |
M3Exam | DAMO | M3Exam | A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models. |
KoLA | THU-KEG | KoLA | Knowledge-oriented LLM Assessment benchmark (KoLA), hosted by Knowledge Engineering Group, Tsinghua University (THU-KEG), aims to benchmark LLMs' world knowledge by meticulously designing data, ability taxonomy, and evaluation metrics. |
promptbench | Microsoft | promptbench | PromptBench is a powerful tool to scrutinize and analyze large language models' interaction with prompts. It simulates black-box adversarial prompt attacks and evaluates model performance. The repository provides code, datasets, and instructions for experiments. |
OpenCompass | Shanghai AI Lab | OpenCompass | OpenCompass is an LLM evaluation platform supporting 20+ models over 50+ datasets for comprehensive benchmarking using efficient distributed evaluation techniques. |
JioNLP-LLM Evaluation Dataset | jionlp | JioNLP-LLM Evaluation Dataset | The JioNLP-LLM Evaluation Dataset is used to evaluate general LLM performance, focusing on their assistance to users and whether they reach the level of a "smart assistant." It includes multiple-choice questions from various professional exams and subjective questions to assess common LLM functions. |
BIG-bench | BIG-bench | BIG bench consists of 204 tasks spanning linguistic, childhood development, mathematical, commonsense reasoning, biological, physical, societal bias, and software development domains. | |
BIG-Bench-Hard | Stanford NLP | BIG-Bench-Hard | BIG-Bench-Hard (BBH) contains 23 challenging tasks, where prior model evaluations didn't surpass human-rater performance. |
SuperCLUE | CLUEbenchmark | SuperCLUE | A Chinese benchmark covering basic, professional, and Chinese-specific abilities with a variety of tasks in semantic understanding, dialogue, logic reasoning, role simulation, coding, and more. |
Safety Eval | Tsinghua University | Safety Eval - Safety Large Model Evaluation | An evaluation set by Tsinghua University covering hate speech, prejudice, crime, privacy, ethics, and more, categorized into 40+ safety categories. |
GAOKAO-Bench | OpenLMLab | GAOKAO-Bench | GAOKAO-bench evaluates the language understanding and logical reasoning abilities of large models using Chinese college entrance examination questions. |
Gaokao | ExpressAI | Gaokao | "GaoKao Benchmark" aims to assess and track our progress in achieving human-level intelligence. It provides a comprehensive evaluation of various tasks and domains for comparison with human performance. |
MMLU | paperswithcode.com | MMLU | The MMLU evaluation dataset covers 57 subjects in STEM, humanities, and social sciences, ranging from elementary to professional levels. |
CMMLU | MBZUAI & ShangHai JiaoTong & Microsoft | CMMLU | Measuring massive multitask language understanding in Chinese |
MMCU | Oracle AI Research | MMCU | MMCU evaluates Chinese large models' performance in medical, legal, psychological, and educational domains. |
AGIEval | Microsoft Research | AGIEval | AGIEval comprehensively evaluates base models' cognitive and problem-solving abilities using various official entrance and professional qualification exams. |
C_Eval | SJTU, Tsinghua, University of Edinburgh | C_Eval | C_Eval evaluates models' higher-level knowledge and reasoning abilities across 52 disciplines. |
XieZhi | Fudan University | XieZhi | XieZhi is a comprehensive evaluation suite for Language Models, spanning various disciplines and difficulty levels. |
MT-bench | Multiple Universities | MT-bench | MT-bench is a benchmark with 80 high-quality multi-turn questions designed to test multi-turn conversation and instruction-following ability. |
GLUE Benchmark | Multiple Institutions | GLUE Benchmark | GLUE Benchmark evaluates models' performance in various tasks like grammar, paraphrasing, text similarity, inference, textual entailment, and pronoun resolution. |
OpenAI Moderation API | OpenAI | OpenAI Moderation API | Filters harmful or unsafe content. |
GSM8K | OpenAI | GSM8K | GSM8K is a dataset of linguistically diverse grade school math word problems, testing mathematical problem-solving abilities. |
EleutherAI LM Eval | EleutherAI | EleutherAI LM Eval | Evaluates model performance with few-shot tasks and fine-tuning across multiple tasks. |
OpenAI Evals | OpenAI | OpenAI Evals | Evaluates generated text for accuracy, diversity, consistency, robustness, transferability, efficiency, and fairness. |
AlpacaEval | tatsu-lab | AlpacaEval | An automatic evaluation based on AlpacaFarm evaluation set, comparing responses with reference answers. |
Adversarial NLI (ANLI) | Facebook AI Research, others | Adversarial NLI (ANLI) | Evaluates model robustness, generalization, inference explanations, and efficiency under adversarial samples. |
LIT (Language Interpretability Tool) | LIT | Provides a platform to evaluate, analyze model strengths, weaknesses, and potential biases based on user-defined metrics. | |
ParlAI | Facebook AI Research | ParlAI | Evaluates model performance in terms of accuracy, F1 score, perplexity, human ratings, speed, robustness, and generalization. |
CoQA | Stanford NLP Group | CoQA | Evaluates models' comprehension of paragraphs and answering related questions in a conversational context. |
LAMBADA | University of Trento, Fondazione Bruno Kessler | LAMBADA | Measures models' long-term understanding by predicting the last word of paragraphs. |
HellaSwag | University of Washington, Allen Institute for AI | HellaSwag | Evaluates models' reasoning abilities using counterfactual statements. |
LogiQA | Tsinghua University, Microsoft Research Asia | LogiQA | Evaluates models' logical reasoning abilities. |
MultiNLI | Multiple Institutions | MultiNLI | Evaluates models' ability to understand relationships between sentences from different genres. |
SQUAD | Stanford NLP Group | SQUAD | Evaluates models' reading comprehension abilities. |
Open LLM Leaderboard | HuggingFace | Leaderboard | HuggingFace's LLM evaluation leaderboard covering AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA datasets. |
chinese-llm-benchmark | jeinlee1991 | llm-benchmark | Chinese LLM benchmark covering various open-source models and multidimensional evaluations. |
AlpacaEval | tatsu-lab | AlpacaEval | LLM-based automatic evaluation for open-source models' performance. |
Huggingface Open LLM Leaderboard | huggingface | HF Open LLM Leaderboard | Evaluates open-source models on four evaluation sets: AI2 Reasoning Challenge, HellaSwag, MMLU, and TruthfulQA. |
lmsys-arena | Berkeley | lmsys Ranking | Rankings based on the Elo rating mechanism: GPT4 > Claude > GPT3.5 > Vicuna > others. |
CMU Open-Source Chatbot Evaluation | CMU | zeno-build | Evaluates models in dialogue scenarios, ranking ChatGPT > Vicuna > others. |
Z-Bench Chinese ZhenFund Evaluation | ZhenFund | Z-Bench | Evaluates Chinese models, with minor differences; improvements in ChatGLM 6B versions. |
Chain-of-thought Evaluation | Yao Fu | COT Evaluation | Rankings include GSM8k, MATH, and complex problems. |
InfoQ Large Model Comprehensive Evaluation | InfoQ | InfoQ Evaluation | Chinese-based ranking including ChatGPT, 文心一言, Claude, and 星火. |
ToolBench Tool Invocation Evaluation | Zhiguan/ClearHu | ToolBench | Compares models' performance with tool-tuned models and ChatGPT. |
AgentBench Inference Decision Evaluation | THUDM | AgentBench | Evaluates models' inference and decision-making abilities in various scenarios like shopping, home, and operating systems. |
FlagEval | Zhiguan/ClearHu | FlagEval | Provides LLM ranking using subjective and objective scores. |
ChatEval | THU-NLP | ChatEval | Simplifies human evaluation of generated text by involving human raters in discussions. |
Zhujiu | Institute of Automation, CAS | Zhujiu | Multidimensional evaluation covering 7 ability dimensions and 51 tasks in both Chinese and English. |
LucyEval | Oracle | LucyEval | Evaluates Chinese large models' maturity using objective tests across various abilities. |
Do-Not-Answer | Libr-AI | Do-Not-Answer | Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4. |
ColossalEval | Colossal-AI | ColossalEval | ColossalEval is a project which provides a uniform pipeline to help evaluate language models on different public dataset or your own dataset using both classic metrics and the help from GPTs. |
SmartPlay | microsoft | SmartPlay | SmartPlay is a benchmark for Large Language Models (LLMs). It is designed to be easy to use, and to provide a wide variety of games to test agents on. |
LVLM-eHub | OpenGVLab | LVLM-eHub | Multi-Modality Arena is an evaluation platform for large multi-modality models. Following Fastchat, two anonymous models side-by-side are compared on a visual question-answering task. The Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more. |
BLURB | Mindrank AI | BLURB | BLURB comprises of a comprehensive benchmark for PubMed-based biomedical NLP applications, as well as a leaderboard for tracking progress by the community. BLURB includes thirteen publicly available datasets in six diverse tasks. To avoid placing undue emphasis on tasks with many available datasets, such as named entity recognition (NER), BLURB reports the macro average across all tasks as the main score. The BLURB leaderboard is model-agnostic. Any system capable of producing the test predictions using the same training and development data can participate. |
SWE-bench | princeton-nlp | SWE-bench | SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem. |
- Chat Arena: anonymous models side-by-side and vote for which one is better - Open Source AI "Anonymous" Arena! Here, you can become a referee and rate the responses of two models whose names you don't know in advance. After scoring, their real identities will be revealed. The current "participants" include Vicuna, Koala, OpenAssistant (oasst), Dolly, ChatGLM, StableLM, Alpaca, LLaMA, and more.
Models | MMLU | CEval | M3KE | Xiezhi-Spec.-Chinese | Xiezhi-Inter.-Chinese | Xiezhi-Spec.-English | Xiezhi-Inter.-English | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0-shot | 1-shot | 3-shot | 0-shot | 1-shot | 3-shot | 0-shot | 0-shot | 1-shot | 3-shot | 0-shot | 1-shot | 3-shot | 0-shot | 1-shot | 3-shot | 0-shot | 1-shot | 3-shot | |
Random-Guess | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 | 0.089 |
Generation Probability For Ranking | |||||||||||||||||||
Bloomz-560m | 0.111 | 0.109 | 0.119 | 0.124 | 0.117 | 0.103 | 0.126 | 0.123 | 0.127 | 0.124 | 0.130 | 0.138 | 0.140 | 0.113 | 0.116 | 0.123 | 0.124 | 0.117 | 0.160 |
Bloomz-1b1 | 0.131 | 0.116 | 0.128 | 0.107 | 0.115 | 0.110 | 0.082 | 0.138 | 0.108 | 0.107 | 0.117 | 0.125 | 0.123 | 0.130 | 0.119 | 0.114 | 0.144 | 0.129 | 0.145 |
Bloomz-1b7 | 0.107 | 0.117 | 0.164 | 0.054 | 0.058 | 0.103 | 0.102 | 0.165 | 0.151 | 0.159 | 0.152 | 0.214 | 0.170 | 0.133 | 0.140 | 0.144 | 0.150 | 0.149 | 0.209 |
Bloomz-3b | 0.139 | 0.084 | 0.146 | 0.168 | 0.182 | 0.194 | 0.063 | 0.186 | 0.154 | 0.168 | 0.151 | 0.180 | 0.182 | 0.201 | 0.155 | 0.156 | 0.175 | 0.164 | 0.158 |
Bloomz-7b1 | 0.167 | 0.160 | 0.205 | 0.074 | 0.072 | 0.073 | 0.073 | 0.154 | 0.178 | 0.162 | 0.148 | 0.160 | 0.156 | 0.176 | 0.153 | 0.207 | 0.217 | 0.204 | 0.229 |
Bloomz-7b1-mt | 0.189 | 0.196 | 0.210 | 0.077 | 0.078 | 0.158 | 0.072 | 0.163 | 0.175 | 0.154 | 0.155 | 0.195 | 0.164 | 0.180 | 0.146 | 0.219 | 0.228 | 0.171 | 0.232 |
Bloomz-7b1-p3 | 0.066 | 0.059 | 0.075 | 0.071 | 0.070 | 0.072 | 0.081 | 0.177 | 0.198 | 0.158 | 0.183 | 0.173 | 0.170 | 0.130 | 0.130 | 0.162 | 0.157 | 0.132 | 0.134 |
Bloomz | 0.051 | 0.066 | 0.053 | 0.142 | 0.166 | 0.240 | 0.098 | 0.185 | 0.133 | 0.277 | 0.161 | 0.099 | 0.224 | 0.069 | 0.082 | 0.056 | 0.058 | 0.055 | 0.049 |
Bloomz-mt | 0.266 | 0.264 | 0.248 | 0.204 | 0.164 | 0.151 | 0.161 | 0.253 | 0.198 | 0.212 | 0.213 | 0.189 | 0.184 | 0.379 | 0.396 | 0.394 | 0.383 | 0.405 | 0.398 |
Bloomz-p3 | 0.115 | 0.093 | 0.057 | 0.118 | 0.137 | 0.140 | 0.115 | 0.136 | 0.095 | 0.105 | 0.086 | 0.065 | 0.098 | 0.139 | 0.097 | 0.069 | 0.176 | 0.141 | 0.070 |
llama-7b | 0.125 | 0.132 | 0.093 | 0.133 | 0.106 | 0.110 | 0.158 | 0.152 | 0.141 | 0.117 | 0.142 | 0.135 | 0.128 | 0.159 | 0.165 | 0.161 | 0.194 | 0.183 | 0.176 |
llama-13b | 0.166 | 0.079 | 0.135 | 0.152 | 0.181 | 0.169 | 0.131 | 0.133 | 0.241 | 0.243 | 0.211 | 0.202 | 0.303 | 0.154 | 0.183 | 0.215 | 0.174 | 0.216 | 0.231 |
llama-30b | 0.076 | 0.107 | 0.073 | 0.079 | 0.119 | 0.082 | 0.079 | 0.140 | 0.206 | 0.162 | 0.186 | 0.202 | 0.183 | 0.110 | 0.195 | 0.161 | 0.088 | 0.158 | 0.219 |
llama-65b | 0.143 | 0.121 | 0.100 | 0.154 | 0.141 | 0.168 | 0.125 | 0.142 | 0.129 | 0.084 | 0.108 | 0.077 | 0.077 | 0.183 | 0.204 | 0.172 | 0.133 | 0.191 | 0.157 |
baize-7b~(lora) | 0.129 | 0.091 | 0.079 | 0.194 | 0.180 | 0.206 | 0.231 | 0.216 | 0.148 | 0.123 | 0.173 | 0.158 | 0.198 | 0.182 | 0.190 | 0.194 | 0.218 | 0.188 | 0.209 |
baize-7b-healthcare~(lora) | 0.130 | 0.121 | 0.106 | 0.178 | 0.174 | 0.178 | 0.203 | 0.178 | 0.146 | 0.123 | 0.266 | 0.107 | 0.118 | 0.175 | 0.164 | 0.173 | 0.197 | 0.231 | 0.198 |
baize-13b~(lora) | 0.131 | 0.111 | 0.171 | 0.184 | 0.178 | 0.195 | 0.155 | 0.158 | **0.221 ** | 0.256 | 0.208 | 0.200 | 0.219 | 0.176 | 0.189 | 0.239 | 0.187 | 0.185 | 0.274 |
baize-30b~(lora) | 0.193 | 0.216 | 0.207 | 0.191 | 0.196 | 0.121 | 0.071 | 0.109 | 0.212 | 0.190 | 0.203 | 0.256 | 0.200 | 0.167 | 0.235 | 0.168 | 0.072 | 0.180 | 0.193 |
Belle-0.2M | 0.127 | 0.148 | 0.243 | 0.053 | 0.063 | 0.136 | 0.076 | 0.172 | 0.126 | 0.153 | 0.171 | 0.165 | 0.147 | 0.206 | 0.146 | 0.148 | 0.217 | 0.150 | 0.173 |
Belle-0.6M | 0.091 | 0.114 | 0.180 | 0.082 | 0.080 | 0.090 | 0.075 | 0.188 | 0.149 | 0.198 | 0.188 | 0.188 | 0.175 | 0.173 | 0.172 | 0.183 | 0.193 | 0.184 | 0.196 |
Belle-1M | 0.137 | 0.126 | 0.162 | 0.066 | 0.065 | 0.072 | 0.066 | 0.170 | 0.152 | 0.147 | 0.173 | 0.176 | 0.197 | 0.211 | 0.137 | 0.149 | 0.207 | 0.151 | 0.185 |
Belle-2M | 0.127 | 0.148 | 0.132 | 0.058 | 0.063 | 0.136 | 0.057 | 0.163 | 0.166 | 0.130 | 0.159 | 0.177 | 0.163 | 0.155 | 0.106 | 0.166 | 0.151 | 0.150 | 0.138 |
chatglm-6B | 0.099 | 0.109 | 0.112 | 0.084 | 0.074 | 0.114 | 0.115 | 0.082 | 0.097 | 0.147 | 0.104 | 0.111 | 0.144 | 0.106 | 0.120 | 0.124 | 0.099 | 0.079 | 0.097 |
doctorglm-6b | 0.093 | 0.076 | 0.065 | 0.037 | 0.085 | 0.051 | 0.038 | 0.062 | 0.068 | 0.044 | 0.047 | 0.056 | 0.043 | 0.069 | 0.053 | 0.043 | 0.106 | 0.059 | 0.059 |
moss-base-16B | 0.072 | 0.050 | 0.062 | 0.115 | 0.048 | 0.052 | 0.099 | 0.105 | 0.051 | 0.059 | 0.123 | 0.054 | 0.058 | 0.124 | 0.077 | 0.080 | 0.121 | 0.058 | 0.063 |
moss-sft-16B | 0.064 | 0.065 | 0.051 | 0.063 | 0.062 | 0.072 | 0.075 | 0.072 | 0.067 | 0.068 | 0.073 | 0.081 | 0.066 | 0.071 | 0.070 | 0.059 | 0.074 | 0.084 | 0.075 |
vicuna-7b | 0.051 | 0.051 | 0.029 | 0.063 | 0.071 | 0.064 | 0.059 | 0.169 | 0.171 | 0.165 | 0.134 | 0.201 | 0.213 | 0.182 | 0.209 | 0.195 | 0.200 | 0.214 | 0.182 |
vicuna-13b | 0.109 | 0.104 | 0.066 | 0.060 | 0.131 | 0.131 | 0.067 | 0.171 | 0.167 | 0.166 | 0.143 | 0.147 | 0.178 | 0.121 | 0.139 | 0.128 | 0.158 | 0.174 | 0.191 |
alpaca-7b | 0.135 | 0.170 | 0.202 | 0.137 | 0.119 | 0.113 | 0.142 | 0.129 | 0.139 | 0.123 | 0.178 | 0.104 | 0.097 | 0.189 | 0.179 | 0.128 | 0.200 | 0.185 | 0.149 |
pythia-1.4b | 0.124 | 0.127 | 0.121 | 0.108 | 0.132 | 0.138 | 0.083 | 0.125 | 0.128 | 0.135 | 0.111 | 0.146 | 0.135 | 0.158 | 0.124 | 0.124 | 0.166 | 0.126 | 0.118 |
pythia-2.8b | 0.103 | 0.110 | 0.066 | 0.064 | 0.089 | 0.122 | 0.086 | 0.114 | 0.120 | 0.131 | 0.091 | 0.113 | 0.112 | 0.126 | 0.118 | 0.112 | 0.110 | 0.145 | 0.107 |
pythia-6.9b | 0.115 | 0.070 | 0.084 | 0.078 | 0.073 | 0.094 | 0.073 | 0.086 | 0.094 | 0.092 | 0.097 | 0.098 | 0.085 | 0.091 | 0.088 | 0.083 | 0.099 | 0.099 | 0.096 |
pythia-12b | 0.075 | 0.059 | 0.066 | 0.077 | 0.097 | 0.078 | 0.098 | 0.102 | 0.126 | 0.132 | 0.125 | 0.147 | 0.159 | 0.079 | 0.098 | 0.110 | 0.094 | 0.120 | 0.120 |
gpt-neox-20b | 0.081 | 0.132 | 0.086 | 0.086 | 0.096 | 0.069 | 0.094 | 0.140 | 0.103 | 0.109 | 0.120 | 0.098 | 0.085 | 0.088 | 0.101 | 0.116 | 0.099 | 0.113 | 0.156 |
h2ogpt-12b | 0.075 | 0.087 | 0.078 | 0.080 | 0.078 | 0.094 | 0.070 | 0.065 | 0.047 | 0.073 | 0.076 | 0.061 | 0.091 | 0.088 | 0.050 | 0.065 | 0.105 | 0.063 | 0.067 |
h2ogpt-20b | 0.114 | 0.098 | 0.110 | 0.094 | 0.084 | 0.061 | 0.096 | 0.108 | 0.080 | 0.073 | 0.086 | 0.081 | 0.072 | 0.108 | 0.068 | 0.086 | 0.109 | 0.071 | 0.079 |
dolly-3b | 0.066 | 0.060 | 0.055 | 0.079 | 0.083 | 0.077 | 0.066 | 0.100 | 0.090 | 0.083 | 0.091 | 0.093 | 0.085 | 0.079 | 0.063 | 0.077 | 0.076 | 0.074 | 0.084 |
dolly-7b | 0.095 | 0.068 | 0.052 | 0.091 | 0.079 | 0.070 | 0.108 | 0.108 | 0.089 | 0.092 | 0.111 | 0.095 | 0.100 | 0.096 | 0.059 | 0.086 | 0.123 | 0.085 | 0.090 |
dolly-12b | 0.095 | 0.068 | 0.093 | 0.085 | 0.071 | 0.073 | 0.114 | 0.098 | 0.106 | 0.103 | 0.094 | 0.114 | 0.106 | 0.086 | 0.088 | 0.098 | 0.088 | 0.102 | 0.116 |
stablelm-3b | 0.070 | 0.085 | 0.071 | 0.086 | 0.082 | 0.099 | 0.096 | 0.101 | 0.087 | 0.091 | 0.083 | 0.092 | 0.067 | 0.069 | 0.089 | 0.081 | 0.066 | 0.085 | 0.088 |
stablelm-7b | 0.158 | 0.118 | 0.093 | 0.133 | 0.102 | 0.093 | 0.140 | 0.085 | 0.118 | 0.122 | 0.123 | 0.130 | 0.095 | 0.123 | 0.103 | 0.100 | 0.134 | 0.121 | 0.105 |
falcon-7b | 0.048 | 0.046 | 0.051 | 0.046 | 0.051 | 0.052 | 0.050 | 0.077 | 0.096 | 0.112 | 0.129 | 0.141 | 0.142 | 0.124 | 0.103 | 0.107 | 0.198 | 0.200 | 0.205 |
falcon-7b-instruct | 0.078 | 0.095 | 0.106 | 0.114 | 0.095 | 0.079 | 0.104 | 0.075 | 0.083 | 0.087 | 0.060 | 0.133 | 0.123 | 0.160 | 0.203 | 0.156 | 0.141 | 0.167 | 0.152 |
falcon-40b | 0.038 | 0.043 | 0.077 | 0.085 | 0.090 | 0.129 | 0.087 | 0.069 | 0.056 | 0.053 | 0.065 | 0.063 | 0.058 | 0.059 | 0.077 | 0.066 | 0.085 | 0.063 | 0.076 |
falcon-40b-instruct | 0.126 | 0.123 | 0.121 | 0.070 | 0.080 | 0.068 | 0.141 | 0.103 | 0.085 | 0.079 | 0.115 | 0.082 | 0.081 | 0.118 | 0.143 | 0.124 | 0.083 | 0.108 | 0.104 |
Instruction For Ranking | |||||||||||||||||||
ChatGPT | 0.240 | 0.298 | 0.371 | 0.286 | 0.289 | 0.360 | 0.290 | 0.218 | 0.352 | 0.414 | 0.266 | 0.418 | 0.487 | 0.217 | 0.361 | 0.428 | 0.305 | 0.452 | 0.517 |
GPT-4 | 0.402 | 0.415 | 0.517 | 0.413 | 0.410 | 0.486 | 0.404 | 0.392 | 0.429 | 0.490 | 0.453 | 0.496 | 0.565 | 0.396 | 0.434 | 0.495 | 0.463 | 0.506 | 0.576 |
Statistic | |||||||||||||||||||
Performance-Average | 0.120 | 0.117 | 0.125 | 0.113 | 0.114 | 0.124 | 0.111 | 0.140 | 0.140 | 0.145 | 0.144 | 0.148 | 0.152 | 0.145 | 0.145 | 0.150 | 0.156 | 0.157 | 0.166 |
Performance-Variance | 0.062 | 0.068 | 0.087 | 0.067 | 0.065 | 0.078 | 0.064 | 0.058 | 0.070 | 0.082 | 0.067 | 0.082 | 0.095 | 0.067 | 0.080 | 0.090 | 0.078 | 0.092 | 0.104 |
- G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment,
by Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu - A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning,
Hallucination, and Interactivity,
by Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji et al. - Is ChatGPT a General-Purpose Natural Language Processing Task Solver?,
by Qin, Chengwei, Zhang, Aston, Zhang, Zhuosheng, Chen, Jiaao, Yasunaga, Michihiro and Yang, Diyi - ChatGPT versus Traditional Question Answering for Knowledge Graphs:
Current Status and Future Directions Towards Knowledge Graph Chatbots,
by Reham Omar, Omij Mangukiya, Panos Kalnis and Essam Mansour - Mathematical Capabilities of ChatGPT,
by Simon Frieder, Luca Pinchetti, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier and Julius Berner - Exploring the Limits of ChatGPT for Query or Aspect-based Text Summarization,
by Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen and Wei Cheng - On the Robustness of ChatGPT: An Adversarial and Out-of-distribution
Perspective,
by Jindong Wang, Xixu Hu, Wenxin Hou, Hao Chen, Runkai Zheng, Yidong Wang, Linyi Yang, Haojun Huang et al. - ChatGPT is not all you need. A State of the Art Review of large
Generative AI models,
by Roberto Gozalo-Brizuela and Eduardo C. Garrido-Merch'an - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned
BERT,
by Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du and Dacheng Tao - Evaluation of ChatGPT as a Question Answering System for Answering
Complex Questions,
by Yiming Tan, Dehai Min, Yu Li, Wenbo Li, Nan Hu, Yongrui Chen and Guilin Qi - ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models,
by Ning Bian, Xianpei Han, Le Sun, Hongyu Lin, Yaojie Lu and Ben He - Holistic Evaluation of Language Models,
by Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan et al. - Evaluating the Text-to-SQL Capabilities of Large Language Models,
by Nitarshan Rajkumar, Raymond Li and Dzmitry Bahdanau - Are Visual-Linguistic Models Commonsense Knowledge Bases?,
by Hsiu-Yu Yang and Carina Silberer - Is GPT-3 a Psychopath? Evaluating Large Language Models from a Psychological
Perspective,
by Xingxuan Li, Yutong Li, Linlin Liu, Lidong Bing and Shafiq R. Joty - GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained
Language Models,
by Da Yin, Hritik Bansal, Masoud Monajatipoor, Liunian Harold Li and Kai-Wei Chang - RobustLR: A Diagnostic Benchmark for Evaluating Logical Robustness
of Deductive Reasoners,
by Soumya Sanyal, Zeyi Liao and Xiang Ren - A Systematic Evaluation of Large Language Models of Code,
by Frank F. Xu, Uri Alon, Graham Neubig and Vincent J. Hellendoorn - Evaluating Large Language Models Trained on Code,
by Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond'e de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda et al. - GLGE: A New General Language Generation Evaluation Benchmark,
by Dayiheng Liu, Yu Yan, Yeyun Gong, Weizhen Qi, Hang Zhang, Jian Jiao, Weizhu Chen, Jie Fu et al. - Evaluating Pre-Trained Models for User Feedback Analysis in Software
Engineering: A Study on Classification of App-Reviews,
by Mohammad Abdul Hadi and Fatemeh H. Fard - Do Language Models Perform Generalizable Commonsense Inference?,
by Peifeng Wang, Filip Ilievski, Muhao Chen and Xiang Ren - RICA: Evaluating Robust Inference Capabilities Based on Commonsense
Axioms,
by Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara and Xiang Ren - Evaluation of Text Generation: A Survey,
by Asli Celikyilmaz, Elizabeth Clark and Jianfeng Gao - Neural Language Generation: Formulation, Methods, and Evaluation,
by Cristina Garbacea and Qiaozhu Mei - BERTScore: Evaluating Text Generation with BERT,
by Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger and Yoav Artzi
Model | Size | Architecture | Access | Date | Origin |
---|---|---|---|---|---|
Switch Transformer | 1.6T | Decoder(MOE) | - | 2021-01 | Paper |
GLaM | 1.2T | Decoder(MOE) | - | 2021-12 | Paper |
PaLM | 540B | Decoder | - | 2022-04 | Paper |
MT-NLG | 530B | Decoder | - | 2022-01 | Paper |
J1-Jumbo | 178B | Decoder | api | 2021-08 | Paper |
OPT | 175B | Decoder | api | ckpt | 2022-05 | Paper |
BLOOM | 176B | Decoder | api | ckpt | 2022-11 | Paper |
GPT 3.0 | 175B | Decoder | api | 2020-05 | Paper |
LaMDA | 137B | Decoder | - | 2022-01 | Paper |
GLM | 130B | Decoder | ckpt | 2022-10 | Paper |
YaLM | 100B | Decoder | ckpt | 2022-06 | Blog |
LLaMA | 65B | Decoder | ckpt | 2022-09 | Paper |
GPT-NeoX | 20B | Decoder | ckpt | 2022-04 | Paper |
UL2 | 20B | agnostic | ckpt | 2022-05 | Paper |
鹏程.盘古α | 13B | Decoder | ckpt | 2021-04 | Paper |
T5 | 11B | Encoder-Decoder | ckpt | 2019-10 | Paper |
CPM-Bee | 10B | Decoder | api | 2022-10 | Paper |
rwkv-4 | 7B | RWKV | ckpt | 2022-09 | Github |
GPT-J | 6B | Decoder | ckpt | 2022-09 | Github |
GPT-Neo | 2.7B | Decoder | ckpt | 2021-03 | Github |
GPT-Neo | 1.3B | Decoder | ckpt | 2021-03 | Github |
Model | Size | Architecture | Access | Date | Origin |
---|---|---|---|---|---|
Flan-PaLM | 540B | Decoder | - | 2022-10 | Paper |
BLOOMZ | 176B | Decoder | ckpt | 2022-11 | Paper |
InstructGPT | 175B | Decoder | api | 2022-03 | Paper |
Galactica | 120B | Decoder | ckpt | 2022-11 | Paper |
OpenChatKit | 20B | - | ckpt | 2023-3 | - |
Flan-UL2 | 20B | Decoder | ckpt | 2023-03 | Blog |
Gopher | - | - | - | - | - |
Chinchilla | - | - | - | - | - |
Flan-T5 | 11B | Encoder-Decoder | ckpt | 2022-10 | Paper |
T0 | 11B | Encoder-Decoder | ckpt | 2021-10 | Paper |
Alpaca | 7B | Decoder | demo | 2023-03 | Github |
Model | Size | Architecture | Access | Date | Origin |
---|---|---|---|---|---|
GPT 4 | - | - | - | 2023-03 | Blog |
ChatGPT | - | Decoder | demo|api | 2022-11 | Blog |
Sparrow | 70B | - | - | 2022-09 | Paper |
Claude | - | - | demo|api | 2023-03 | Blog |
-
LLaMA - A foundational, 65-billion-parameter large language model. LLaMA.cpp Lit-LLaMA
- Alpaca - A model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca.cpp Alpaca-LoRA
- Flan-Alpaca - Instruction Tuning from Humans and Machines.
- Baize - Baize is an open-source chat model trained with LoRA. It uses 100k dialogs generated by letting ChatGPT chat with itself.
- Cabrita - A portuguese finetuned instruction LLaMA.
- Vicuna - An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality.
- Llama-X - Open Academic Research on Improving LLaMA to SOTA LLM.
- Chinese-Vicuna - A Chinese Instruction-following LLaMA-based Model.
- GPTQ-for-LLaMA - 4 bits quantization of LLaMA using GPTQ.
- GPT4All - Demo, data, and code to train open-source assistant-style large language model based on GPT-J and LLaMa.
- Koala - A Dialogue Model for Academic Research
- BELLE - Be Everyone's Large Language model Engine
- StackLLaMA - A hands-on guide to train LLaMA with RLHF.
- RedPajama - An Open Source Recipe to Reproduce LLaMA training dataset.
- Chimera - Latin Phoenix.
-
BLOOM - BigScience Large Open-science Open-access Multilingual Language Model BLOOM-LoRA
- BLOOMZ&mT0 - a family of models capable of following human instructions in dozens of languages zero-shot.
- Phoenix
-
T5 - Text-to-Text Transfer Transformer
- T0 - Multitask Prompted Training Enables Zero-Shot Task Generalization
-
OPT - Open Pre-trained Transformer Language Models.
-
UL2 - a unified framework for pretraining models that are universally effective across datasets and setups.
-
GLM- GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language understanding and generation tasks.
- ChatGLM-6B: ChatGLM-6B is an open-source bilingual conversation language model that supports both Chinese and English. It's built upon the General Language Model (GLM) architecture and has 6.2 billion parameters.
- ChatGLM2-6B: The second-generation version of the open-source bilingual dialogue model ChatGLM-6B. ChatGLM2-6B retains the excellent features of the first-generation model, such as smooth conversations and low deployment thresholds, while introducing longer context, better performance, and more efficient inference. This project is licensed under the MIT License.
-
RWKV - Parallelizable RNN with Transformer-level LLM Performance.
- ChatRWKV - ChatRWKV is like ChatGPT but powered by my RWKV (100% RNN) language model.
-
StableLM - Stability AI Language Models.
-
YaLM - a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.
-
GPT-Neo - An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.
-
GPT-J - A 6 billion parameter, autoregressive text generation model trained on The Pile.
- Dolly - a cheap-to-build LLM that exhibits a surprising degree of the instruction following capabilities exhibited by ChatGPT.
-
Pythia - Interpreting Autoregressive Transformers Across Time and Scale
- Dolly 2.0 - the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use.
-
OpenFlamingo - an open-source reproduction of DeepMind's Flamingo model.
-
Cerebras-GPT - A Family of Open, Compute-efficient, Large Language Models.
-
GALACTICA - The GALACTICA models are trained on a large-scale scientific corpus.
- GALPACA - GALACTICA 30B fine-tuned on the Alpaca dataset.
-
Palmyra - Palmyra Base was primarily pre-trained with English text.
-
Camel - a state-of-the-art instruction-following large language model designed to deliver exceptional performance and versatility.
-
PanGu-α - PanGu-α is a 200B parameter autoregressive pretrained Chinese language model develped by Huawei Noah's Ark Lab, MindSpore Team and Peng Cheng Laboratory.
-
MOSS - MOSS是一个支持中英双语和多种插件的开源对话语言模型.
-
Open-Assistant - a project meant to give everyone access to a great chat based large language model.
- HuggingChat - Powered by Open Assistant's latest model – the best open source chat model right now and @huggingface Inference API.
- Baichuan - An open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following Baichuan-7B, containing 13 billion parameters. (20230715)
-
- Qwen - Qwen-7B is the 7B-parameter version of the large language model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-7B is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. (20230803)
Model | #Author | #Link | #Parameter | Base Model | #Layer | #Encoder | #Decoder | #Pretrain Tokens | #IFT Sample | RLHF |
---|---|---|---|---|---|---|---|---|---|---|
GPT3-Ada | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 0.35B | - | 24 | - | 24 | - | - | - |
Pythia-1B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-1b | 1B | - | 16 | - | 16 | 300B tokens | - | - |
GPT3-Babbage | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 1.3B | - | 24 | - | 24 | - | - | - |
GPT2-XL | radford2019language | https://huggingface.co/gpt2-xl | 1.5B | - | 48 | - | 48 | 40B tokens | - | - |
BLOOM-1b7 | scao2022bloom | https://huggingface.co/bigscience/bloom-1b7 | 1.7B | - | 24 | - | 24 | 350B tokens | - | - |
BLOOMZ-1b7 | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-1b7 | 1.7B | BLOOM-1b7 | 24 | - | 24 | - | 8.39B tokens | - |
Dolly-v2-3b | 2023dolly | https://huggingface.co/databricks/dolly-v2-3b | 2.8B | Pythia-2.8B | 32 | - | 32 | - | 15K | - |
Pythia-2.8B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-2.8b | 2.8B | - | 32 | - | 32 | 300B tokens | - | - |
BLOOM-3b | scao2022bloom | https://huggingface.co/bigscience/bloom-3b | 3B | - | 30 | - | 30 | 350B tokens | - | - |
BLOOMZ-3b | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-3b | 3B | BLOOM-3b | 30 | - | 30 | - | 8.39B tokens | - |
StableLM-Base-Alpha-3B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-base-alpha-3b | 3B | - | 16 | - | 16 | 800B tokens | - | - |
StableLM-Tuned-Alpha-3B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b | 3B | StableLM-Base-Alpha-3B | 16 | - | 16 | - | 632K | - |
ChatGLM-6B | zeng2023glm-130b,du2022glm | https://huggingface.co/THUDM/chatglm-6b | 6B | - | 28 | 28 | 28 | 1T tokens | \checkmark | \checkmark |
DoctorGLM | xiong2023doctorglm | https://github.com/xionghonglin/DoctorGLM | 6B | ChatGLM-6B | 28 | 28 | 28 | - | 6.38M | - |
ChatGLM-Med | ChatGLM-Med | https://github.com/SCIR-HI/Med-ChatGLM | 6B | ChatGLM-6B | 28 | 28 | 28 | - | 8K | - |
GPT3-Curie | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 6.7B | - | 32 | - | 32 | - | - | - |
MPT-7B-Chat | MosaicML2023Introducing | https://huggingface.co/mosaicml/mpt-7b-chat | 6.7B | MPT-7B | 32 | - | 32 | - | 360K | - |
MPT-7B-Instruct | MosaicML2023Introducing | https://huggingface.co/mosaicml/mpt-7b-instruct | 6.7B | MPT-7B | 32 | - | 32 | - | 59.3K | - |
MPT-7B-StoryWriter-65k+ | MosaicML2023Introducing | https://huggingface.co/mosaicml/mpt-7b-storywriter | 6.7B | MPT-7B | 32 | - | 32 | - | \checkmark | - |
Dolly-v2-7b | 2023dolly | https://huggingface.co/databricks/dolly-v2-7b | 6.9B | Pythia-6.9B | 32 | - | 32 | - | 15K | - |
h2ogpt-oig-oasst1-512-6.9b | 2023h2ogpt | https://huggingface.co/h2oai/h2ogpt-oig-oasst1-512-6.9b | 6.9B | Pythia-6.9B | 32 | - | 32 | - | 398K | - |
Pythia-6.9B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-6.9b | 6.9B | - | 32 | - | 32 | 300B tokens | - | - |
Alpaca-7B | alpaca | https://huggingface.co/tatsu-lab/alpaca-7b-wdiff | 7B | LLaMA-7B | 32 | - | 32 | - | 52K | - |
Alpaca-LoRA-7B | 2023alpacalora | https://huggingface.co/tloen/alpaca-lora-7b | 7B | LLaMA-7B | 32 | - | 32 | - | 52K | - |
Baize-7B | xu2023baize | https://huggingface.co/project-baize/baize-lora-7B | 7B | LLaMA-7B | 32 | - | 32 | - | 263K | - |
Baize Healthcare-7B | xu2023baize | https://huggingface.co/project-baize/baize-healthcare-lora-7B | 7B | LLaMA-7B | 32 | - | 32 | - | 201K | - |
ChatDoctor | yunxiang2023chatdoctor | https://github.com/Kent0n-Li/ChatDoctor | 7B | LLaMA-7B | 32 | - | 32 | - | 167K | - |
HuaTuo | wang2023huatuo | https://github.com/scir-hi/huatuo-llama-med-chinese | 7B | LLaMA-7B | 32 | - | 32 | - | 8K | - |
Koala-7B | koala_blogpost_2023 | https://huggingface.co/young-geng/koala | 7B | LLaMA-7B | 32 | - | 32 | - | 472K | - |
LLaMA-7B | touvron2023llama | https://huggingface.co/decapoda-research/llama-7b-hf | 7B | - | 32 | - | 32 | 1T tokens | - | - |
Luotuo-lora-7b-0.3 | luotuo | https://huggingface.co/silk-road/luotuo-lora-7b-0.3 | 7B | LLaMA-7B | 32 | - | 32 | - | 152K | - |
StableLM-Base-Alpha-7B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-base-alpha-7b | 7B | - | 16 | - | 16 | 800B tokens | - | - |
StableLM-Tuned-Alpha-7B | 2023StableLM | https://huggingface.co/stabilityai/stablelm-tuned-alpha-7b | 7B | StableLM-Base-Alpha-7B | 16 | - | 16 | - | 632K | - |
Vicuna-7b-delta-v1.1 | vicuna2023 | https://github.com/lm-sys/FastChat\#vicuna-weights | 7B | LLaMA-7B | 32 | - | 32 | - | 70K | - |
BELLE-7B-0.2M /0.6M /1M /2M | belle2023exploring | https://huggingface.co/BelleGroup/BELLE-7B-2M | 7.1B | Bloomz-7b1-mt | 30 | - | 30 | - | 0.2M/0.6M/1M/2M | - |
BLOOM-7b1 | scao2022bloom | https://huggingface.co/bigscience/bloom-7b1 | 7.1B | - | 30 | - | 30 | 350B tokens | - | - |
BLOOMZ-7b1 /mt /p3 | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-7b1-p3 | 7.1B | BLOOM-7b1 | 30 | - | 30 | - | 4.19B tokens | - |
Dolly-v2-12b | 2023dolly | https://huggingface.co/databricks/dolly-v2-12b | 12B | Pythia-12B | 36 | - | 36 | - | 15K | - |
h2ogpt-oasst1-512-12b | 2023h2ogpt | https://huggingface.co/h2oai/h2ogpt-oasst1-512-12b | 12B | Pythia-12B | 36 | - | 36 | - | 94.6K | - |
Open-Assistant-SFT-4-12B | 2023openassistant | https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 | 12B | Pythia-12B-deduped | 36 | - | 36 | - | 161K | - |
Pythia-12B | biderman2023pythia | https://huggingface.co/EleutherAI/pythia-12b | 12B | - | 36 | - | 36 | 300B tokens | - | - |
Baize-13B | xu2023baize | https://huggingface.co/project-baize/baize-lora-13B | 13B | LLaMA-13B | 40 | - | 40 | - | 263K | - |
Koala-13B | koala_blogpost_2023 | https://huggingface.co/young-geng/koala | 13B | LLaMA-13B | 40 | - | 40 | - | 472K | - |
LLaMA-13B | touvron2023llama | https://huggingface.co/decapoda-research/llama-13b-hf | 13B | - | 40 | - | 40 | 1T tokens | - | - |
StableVicuna-13B | 2023StableLM | https://huggingface.co/CarperAI/stable-vicuna-13b-delta | 13B | Vicuna-13B v0 | 40 | - | 40 | - | 613K | \checkmark |
Vicuna-13b-delta-v1.1 | vicuna2023 | https://github.com/lm-sys/FastChat\#vicuna-weights | 13B | LLaMA-13B | 40 | - | 40 | - | 70K | - |
moss-moon-003-sft | 2023moss | https://huggingface.co/fnlp/moss-moon-003-sft | 16B | moss-moon-003-base | 34 | - | 34 | - | 1.1M | - |
moss-moon-003-sft-plugin | 2023moss | https://huggingface.co/fnlp/moss-moon-003-sft-plugin | 16B | moss-moon-003-base | 34 | - | 34 | - | 1.4M | - |
GPT-NeoX-20B | gptneox | https://huggingface.co/EleutherAI/gpt-neox-20b | 20B | - | 44 | - | 44 | 825GB | - | - |
h2ogpt-oasst1-512-20b | 2023h2ogpt | https://huggingface.co/h2oai/h2ogpt-oasst1-512-20b | 20B | GPT-NeoX-20B | 44 | - | 44 | - | 94.6K | - |
Baize-30B | xu2023baize | https://huggingface.co/project-baize/baize-lora-30B | 33B | LLaMA-30B | 60 | - | 60 | - | 263K | - |
LLaMA-30B | touvron2023llama | https://huggingface.co/decapoda-research/llama-30b-hf | 33B | - | 60 | - | 60 | 1.4T tokens | - | - |
LLaMA-65B | touvron2023llama | https://huggingface.co/decapoda-research/llama-65b-hf | 65B | - | 80 | - | 80 | 1.4T tokens | - | - |
GPT3-Davinci | brown2020language | https://platform.openai.com/docs/models/gpt-3 | 175B | - | 96 | - | 96 | 300B tokens | - | - |
BLOOM | scao2022bloom | https://huggingface.co/bigscience/bloom | 176B | - | 70 | - | 70 | 366B tokens | - | - |
BLOOMZ /mt /p3 | muennighoff2022crosslingual | https://huggingface.co/bigscience/bloomz-p3 | 176B | BLOOM | 70 | - | 70 | - | 2.09B tokens | - |
ChatGPT~(2023.05.01) | openaichatgpt | https://platform.openai.com/docs/models/gpt-3-5 | - | GPT-3.5 | - | - | - | - | \checkmark | \checkmark |
GPT-4~(2023.05.01) | openai2023gpt4 | https://platform.openai.com/docs/models/gpt-4 | - | - | - | - | - | - | \checkmark | \checkmark |
- Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft - Evaluating Language Models by OpenAI, DeepMind, Google, Microsoft.
- Awesome LLM - A curated list of papers about large language models.
- Awesome ChatGPT Prompts - A collection of prompt examples to be used with the ChatGPT model.
- awesome-chatgpt-prompts-zh - A Chinese collection of prompt examples to be used with the ChatGPT model.
- Awesome ChatGPT - Curated list of resources for ChatGPT and GPT-3 from OpenAI.
- Chain-of-Thoughts Papers - A trend starts from "Chain of Thought Prompting Elicits Reasoning in Large Language Models.
- Instruction-Tuning-Papers - A trend starts from
Natrural-Instruction
(ACL 2022),FLAN
(ICLR 2022) andT0
(ICLR 2022). - LLM Reading List - A paper & resource list of large language models.
- Reasoning using Language Models - Collection of papers and resources on Reasoning using Language Models.
- Chain-of-Thought Hub - Measuring LLMs' Reasoning Performance
- Awesome GPT - A curated list of awesome projects and resources related to GPT, ChatGPT, OpenAI, LLM, and more.
- Awesome GPT-3 - a collection of demos and articles about the OpenAI GPT-3 API.
This project follows the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
If our project is helpful to you, please cite our project.
@misc{junwang2023,
author = {Jun Wang, Changyu Hou, Xiaorui Wang, Pengyong Li, Jingjing Gong, Chen Song, Peng Gao, Qi Shen, Guotong Xie},
title = {Awesome-LLM-Eval: a curated list of tools, benchmarks, demos, papers for Large Language Models Evaluation},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/onejune2018/Awesome-LLM-Eval}},
}
Author's bio:
- Intro: Responsible for AI Platform Algorithm R&D at PA, former work at IBM, PKU, CAS, ETH
- Research: Graph/CV, DL, LLM, Remote Sensing, etc. Co-first author of the Large Graph Model (MPG) for Drug Discovery
- Honors: First place in several international competitions such as SemEval2022, MIT AI-Cure, VQA2021, TREC2021, and EAD2019
- Homepage: https://onejune2018.github.io/homepage/
- Google Scholar: https://scholar.google.com/citations?user=0Be01PgAAAAJ&hl=en