Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main page #1

Open
shm007g opened this issue Apr 19, 2023 · 7 comments
Open

main page #1

shm007g opened this issue Apr 19, 2023 · 7 comments

Comments

@shm007g
Copy link
Owner

shm007g commented Apr 19, 2023

track

@shm007g shm007g closed this as completed May 17, 2023
@shm007g shm007g changed the title experiment log main page May 30, 2023
@shm007g shm007g reopened this May 30, 2023
@shm007g
Copy link
Owner Author

shm007g commented May 30, 2023

  • Providing valuable insights into the latest models, including number of parameters, fine-tuning datasets and techniques, and hardware specifications.
  • Practical guides for LLM alignment post-training, include dataset, benchmark datasets, efficient training libraries and techniques; also involves short insight of pre-trained LLMs.
  • Explore from pre-training models to post-training models, interesting things you will get.

Catalog

Pre-trained Base Models

Simple Version
  • OpenAI: GPT-1, GPT-2, GPT-3, InstructGPT, Code-davinci-002, GPT-3.5, GPT-4(-8k/32k)
  • Anthropic: Claude-v1, Claude Instant
  • Meta: OPT, Galactica, LLaMA
  • huggingface BigScience: BLOOM (176B), BLOOMZ, mT0
  • EleutherAI: GPT-Neo, GPT-J (6B), GPT-NeoX (20B), Pythia
  • TogetherCompute: GPT-JT, RedPajama-7B, RedPajama-INCITE
  • Berkeley: OpenLLaMA
  • MosaicML: MPT-7B, MPT-7B-Instruct/Chat
  • TII: Falcon-7/40B-(instruct)
  • BlinkDL: RWKV-4-Pile, RWKV-4-PilePlus
  • Tsinghua THUDM: GLM-130B, ChatGLM-6B
  • Cerebras: Cerebras-GPT
  • Google: T5, mT5, LaMDA, Pathways, PaLM, UL2, Flan-T5, Flan-UL2, Bard, PaLM-E, PaLM 2, MoE, Switch Transformer, GLaM, ST-MoE, MoE Routing
  • DeepMind: Gopher, Chinchilla, Sparrow
  • Nvidia: Megatron-Turing NLG (530B)
  • AI21 Studio: Jurassic-1, Jurassic-2

A summary of large language models (A Survey of Large Language Models)

LLM Family Tree

  • OpenAI
    • 2018/06, GPT-1 (117m)
    • 2019/02, GPT-2 (1.5B)
    • 2020/06, GPT-3 (175B): ada(350M), babbage(1.3B), curie(6.7B), davinci(175B), detail here
    • 2022/01, InstructGPT-3: text-ada(350M), text-babbage(1.3B), text-curie(6.7B), text-davinci-001(175B)
    • 2022/02, Code-davinci-002
    • GPT-3.5 (175B): text-davinci-002 (2022/03), text-davinci-003 (2022/11), ChatGPT (2022/11), gpt-3.5-turbo (2023/03)
    • 2023/03, GPT-4(-8k/32k)
  • Anthropic
    • Claude-v1: 2023/03, state-of-the-art high-performance model, context window 9k/100k tokens
    • Claude Instant: 2023/03, lighter, less expensive, and much faster option, context window 9k/100k tokens
  • Meta
    • OPT (125M/350M/1.3B/2.7B/6.7B/13B/30B/66B/175B): 2022/03, pre-trained on (datasets used in RoBERTa, the Pile, PushShift.io Reddit) using metaseq, 1/7th the carbon footprint if GPT-3, combining Meta’s open source Fully Sharded Data Parallel (FSDP) API and NVIDIA’s tensor parallel abstraction within Megatron-LM, contain predominantly English text and a small amount of non-English data via CommonCrawl, released under a noncommercial license.
    • OPT-IML (30B/175B): 2022/12, create OPT-IML Bench, a large benchmark for Instruction MetaLearning (IML) of 2000 NLP tasks; train OPT-IML which are instruction-tuned versions of OPT
    • Galactica (125M/1.3B/6.7B/30B/120B): 2022/11, facebook/galactica models are designed to perform scientific tasks, include prompts in pre-training alongside the general corpora, under a non-commercial CC BY-NC 4.0 license
    • LLaMA (7B/13B/33B/65B): 2023/02, trained LLaMA 65B/33B on 1.4 trillion tokens, LLaMA 7B on one trillion tokens, chose text from the 20 languages with the most speakers, leaked, under a non-commercial GPL-3.0 license.
  • huggingface BigScience
    • BLOOM (176B): 2022/07/11, a multilingual LLM trained on ROOTS corpus (a composite collection of 498 Hugging Face datasets), using 250k vocabulary sizes, seq-len 2048, smaller size model search here, release under commercial friendly BigScience Responsible AI License.
    • BLOOMZ & mT0: 2022/11, finetune BLOOM & mT5 on our crosslingual task instruction following mixture (xP3), released under commercial friendly bigscience-bloom-rail-1.0 License.
  • EleutherAI
    • The Pile: 2020/12/31, a 300B (deduplicated 207B) token open source English-only language modelling dataset, download here.
    • GPT-Neo (125M/1.3B/2.7B)(Deprecated): 2021/03/21, A set of decoder-only LLMs trained on the Pile, MIT license.
    • GPT-J (6B): 2021/06/04, EleutherAI/gpt-j-6b, English language model trained on the Pile using mesh-transformer-jax library, seq-len 2048, Apache-2.0 license.
    • GPT-NeoX (20B): 2022/02/10, EleutherAI/gpt-neox-20b, English language model trained on the Pile using GPT-NeoX library, seq-len 2048, Apache-2.0 license.
    • Pythia (70M/160M/410M/1B/1.4B/2.8B/6.9B/12B): 2023/02/13, a suite of 8 model sizes on 2 different datasets: the Pile, the Pile deduplication, using gpt-neox library, seq-len 2048, Apache-2.0 license.
  • TogetherCompute
    • GPT-JT (6B): 2022/11/29, A fork of GPT-J-6B, fine-tuned on 3.53 billion tokens with open-source dataset and techniques, outperforms most 100B+ parameter models at classification.
    • RedPajama-Pythia-7B: 2023/04/17, release RedPajama-Data-1T for reproducing "LLaMA" foundation models in a fully open-source way; 40% RedPajama-Data-1T trained RedPajama-Pythia-7B beat Pythia-7B trained on the Pile and StableLM-7B with higher HELM score, still weaker than LLaMA-7B for now; detail see blog1, blog2 and Card.
    • OpenChatKit: 2023/03/10, fine-tuned for chat from EleutherAI’s GPT-NeoX-20B with over OIG-43M instructions dataset; contributing to a growing corpus of open instruction following dataset.
    • RedPajama-INCITE (3B/7B): 2023/05/05, open-source 3B model (base/chat/instruct) trained on 800B tokens and finetuned, the strongest model in it’s class and brings LLM to a wide variety of hardware; 80% (800B) trained 7B model beat same class GPT-J/Pythia/LLaMA on HELM and lm-evaluation-harness; releasing RedPajama v2 with 2T Tokens (mix the Pile dataset into RedPajama, more code like the Stack); Apache 2.0 license.
    • Berkeley/OpenLLaMA: open source reproduction of Meta AI’s LLaMA 7B/3B trained on the RedPajama dataset, provide PyTorch and JAX weights, Apache-2.0 license.
  • MosaicML
    • MPT (MosaicML Pretrained Transformer, 7B(6.7B)): 2023/05/05, a GPT-style decoder-only transformers trained from scratch on 1T tokens of text and code (RedPajama, mC4, C4, the Stack Dedup, Semantic Scholar ORC) in 9.5 days at a cost of ~$200k, ALiBi (handle 65k long input) and other optimized techniques, matches the quality of LLaMA-7B; open source for commercial use, Apache-2.0 License.
    • MPT-7B-Instruct/Chat: finetuning MPT-7B on instruction following dataset and dialogue generation dataset; release mosaicml/dolly_hhrlhf dataset derived from Databricks Dolly-15k and Anthropic’s Helpful and Harmless datasets; CC-By-SA-3.0 (commercially-usable) / CC-By-NC-SA-4.0 (non-commercial use only).
  • TII (Technology Innovation Institute)
    • Falcon-7/40B-(instruct): 2023/05/26, pretrained on 1500/1000B tokens of RefinedWeb (apache-2.0) enhanced with curated corpora, finetuned on a mixture of chat/instruct datasets like Baize, No. 1 at huggingface Open LLM Leaderboard at the end of May 2023; change license to Apache 2.0 on June 01.
  • BlinkDL
    • RWKV-4-Pile (169M/430M/1.5B/3B/7B/14B): 2023/04, RWKV: Reinventing RNNs for the Transformer Era, leverages RNN with a linear attention mechanism, trained on the Pile, infinite seq-len, Weights.
    • RWKV-4-PilePlus (7B/14B): 2023/04, finetuning on [RedPajama + some of Pile v2 = 1.7T tokens].
  • Tsinghua THUDM
    • GLM-130B: 2022/10, An Open Bilingual Pre-Trained Model, support english and chinese, trained on 400B text tokens using GLM library, Apache-2.0 license.
    • ChatGLM-6B: 2023/03, trained with 1T chinese and english tokens, finetuned with instruction following QA and dialogue dataset in chinese language, released under Apache-2.0 license, authorization needed.
  • Cerebras
    • Cerebras-GPT: 2023/03, a family of seven GPT models ranging from 111M to 13B, trained Eleuther Pile dataset using the Chinchilla formula, release under the Apache 2.0 license
  • Google
  • DeepMind
    • 2021/12, Gopher (280B), SOTA LLM could do instruction-following and dialogue interaction
    • 2022/04, Chinchilla (70B), a 4x smaller model trained on 4x more data (1.3T) outperform Gopher
    • 2022/09, Sparrow, Building safer dialogue agents; designed to talk, answer, and search using Google, supports it with evidence
  • Nvidia
    • 2019/09, Megatron-Turing NLG (530B), largest model trained with novel parallelism techniques of Nvidia
  • AI21 Studio

Licences

  • Apache 2.0: Allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software under the terms of the license, without concern for royalties.
  • MIT: Similar to Apache 2.0 but shorter and simpler. Also, in contrast to Apache 2.0, does not require stating any significant changes to the original code.
  • CC-BY-SA-4.0: Allows (i) copying and redistributing the material and (ii) remixing, transforming, and building upon the material for any purpose, even commercially. But if you do the latter, you must distribute your contributions under the same license as the original. (Thus, may not be viable for internal teams.)
  • CC-By-NC-SA-4.0: NC for non-commercial.
  • BSD-3-Clause: This version allows unlimited redistribution for any purpose as long as its copyright notices and the license's disclaimers of warranty are maintained.
  • OpenRAIL-M v1: Allows royalty-free access and flexible downstream use and sharing of the model and modifications of it, and comes with a set of use restrictions (see Attachment A)

@shm007g
Copy link
Owner Author

shm007g commented May 30, 2023

Track of Open LLMs

$\color{red}{\textsf{Refactoring, Coming soon}}$

  • 05/26: Falcon-40B, foundational LLM with 40 billion parameters trained on one trillion tokens, first place at huggingface Open LLM Leaderboard for now, 7B also released (blog, model, Leaderboard)
  • 05/25: BLIP-Diffusion, a BLIP multi-modal LLM pre-trained subject representation enables zero-shot subject-driven image generation, easily extended for novel applications (tweet, blog)
  • 05/24: C-Eval, is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels (tweet, repo)
  • 05/24: Guanaco-QLoRA, 33B/65B model finetuned on a single 24/48GB GPU in only 12/24h with new QLoRA 4-bit quantization, using small but with quality dataset OASST1 (tweet, repo, demo)
  • 05/23: MMS (Massively Multilingual Speech), release by Meta AI, Can do speech2text and text speech in 1100 languages, half the word error rate of OpenAI Whisper, covers 11 times more languages. (tweet, blog, repo)
  • 05/22: Stanford AlpacaFarm, AlpacaFarm replicates the RLHF process at a fraction of the time (<24h) and cost ($<200), enabling the research community to advance instruction following research (blog, repo)
  • 05/22: LIMA, Less is More for Alignment (Meta AI), LLaMA 65B + 1000 standard supervised samples = {GPT4, Bard} level performance, without RLHF. (tweet, paper)
  • 05/21: 4-bit QLoRA via bitsandbytes (4-bit base model + LoRA) (tweet)
  • 05/20: InstructBLIP Vicuna-13B, generates text based on text and image inputs, and follows human instructions. (tweet, demo, repo)
  • 05/18: CodeT5+, LLM for code understanding and generation (tweet, blog)
  • 05/18: PKU-Beaver, the first chinese open-source RLHF framework developed by PKU-Alignment team at Peking University. Provide a large human-labeled dataset (up to 1M pairs) including both helpful and harmless preferences to support reproducible RLHF research. (blog, repo, data)
  • 05/17: Tree of Thoughts(TOT), GPT-4 Reasoning is Improved 900% with this new prompting (video, paper, repo)
  • 05/13: LaWGPT, a chinese Law LLM, extend chinese law vocab, pretrained on large corpus of law specialty (repo)
  • 05/10: Multimodal-GPT, a multi-modal LLM Based on the open-source multi-modal model OpenFlamingo support tuning vision and language at same time, using parameter efficient tuning with LoRA (tweet, repo)
  • 05/10: DetGPT, a multi-modal LLM addressing reasoning-based object detection problem, could interpret user instruction and automatically locate the object of interest, with only little part of whole model fine-tuned. (blog, repo)
  • 05/10: SoftVC VITS Singing Voice Conversion, A open-source project developed to allow the developers' favorite anime characters to sing. Popular for it's been used in song generation with perticular singer's voice. (repo)
  • 05/10: ImageBind, One Embedding Space To Bind Them All (FAIR, Meta AI), learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications ‘out-of-the-box’ with small fine-tune dataset. (tweet, blog, repo)
  • 05/04: TIP(Dual Text-Image Prompting), a DALLE2/StableDiffusion-2 enhanced LLM that can generate coherent and authentic multimodal procedural plans toward a high-level goal (tweet)
  • 05/04: GPTutor, a ChatGPT-powered tool for code explanation (tweet)
  • 05/04: Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings (blog, tweet)
  • 05/03: Modular/Mojo, a new Python-compatible language with a parallelizing compiler that can import Python libraries, combines the usability of Python with the performance of C, unlocking unparalleled programmability of AI hardware and extensibility of AI models. Only Limited notebook released for now. (tweet, blog, doc)
  • 05/01: VPGTrans: Transfer Visual Prompt Generator across LLMs, a multi-modal LLM release by NUS for its 10 times training cose reduced. (blog, repo)
  • 05/01: "Are Emergent Abilities of Large Language Models a Mirage?" alternative explanation for emergent abilities, strong supporting evidence that emergent abilities may not be a fundamental property of scaling AI models. (paper)
  • 05/01: A brief history of LLaMA models (tweet, blog)
  • 05/01: Geoffrey Hinton left Google. IBM say it can replace over 7500 current employees with AI. Chegg stock price drop 40%.
  • 04/30: PandaLM, provide reproducible and automated comparisons between different large language models (LLMs). (tweet, repo)
  • 04/30: Otter, a Multi-modal chatbots learn to perform tasks through rich instructions on media content (tweet, repo)
  • 04/30: Linly-ChatFlow, Shenzhen University release Linly-ChatFlow-7B/13B/33B/65B fintune on pre-trained Chinese-LLaMA with english and chinese intruction dataset (repo)
  • 04/29: MLC-LLM, an open framework that brings LLMs directly into a broad class of platforms (iPhone, CUDA, Vulkan, Metal) with GPU acceleration! (tweet, blog repo)
  • 04/29: Lamini: The LLM engine for rapidly customizing models without spinning up any GPUs (tweet, blog, repo, doc)
  • 04/29: FastChat-T5, a compact and commercial-friendly chatbot, Fine-tuned from Flan-T5, Outperforms Dolly-V2 with 4x fewer parameters (tweet, repo)
  • 04/29: StabilityAI/StableVicuna, Carper AI from StabilityAI family release RLHF-trained version of Vicuna-13B! (tweet, blog, model)
  • 04/29: StabilityAI/DeepFloyd IF, a powerful text-to-image model that can smartly integrate text into images, utilizes T5-XXL-1.1 as text encoder (tweet, blog)
  • 04/29: MosaicML/SD2, Training Stable Diffusion from Scratch for <$50k with MosaicML (tweet, blog)
  • 04/29: gpt4free, use gpt-4/3.5 free from sites (repo)
  • 04/29: OpenRL is an open-source general reinforcement learning research framework that supports training for various tasks such as single-agent, multi-agent, and natural language. Developed based on PyTorch by chinese company 4paradigm (repo)
  • 04/28: Chinese-LLaMA-Plus-7B, re-pretrain LLaMA on larger(120G) general corpus, fine-tune with 4M instruction dataset, bigger LoRA rank for less precision loss, beat former 13B mdoel on benchmark (repo)
  • 04/28: AudioGPT, a multi-modal GPT model can understand audio/text/image instruction inputs and generate audio, song, style transfer speech, talking head synthesis video (blog, repo, demo)
  • 04/28: Multimodal-GPT, released by the famous MMLab, build base on open-source multi-modal model OpenFlamingo with visual and language instructions (repo)
  • 04/27: "Speed is all you need", generate a 512 × 512 image with 20 iterations on GPU equipped mobile devices in 12- seconds for Stable Diffusion 1.4 without INT8 quantization, 50+% latency reduced on Samsung S23 Ultra. (paper)
  • 04/27: replit-code-v1-3b, it's a 2.7B parameters LLM trained entirely on code in 10 days, performs 40% better than comparable models on benchmark (tweet, model)
  • 04/26: LaMini-LM, a diverse set of 15 (more coming) mini-sized models (up to 1.5B) distilled from 2.6M instructions, comparable in performance to Alpaca-7B in downstream NLP + human eval (tweet, repo, data)
  • 04/26: huggingChat, a 30B OpenAssistant/oasst-sft-6-llama-30b-xor LLM deployed by huggingface (tweet, site, model)
  • 04/26: LLM+P, takes in a planning problem decription, turn it into PDDL, leveraging classical planners to find a solution (tweet, paper, repo)
  • 04/25: NeMo Guardrails, the new toolkit for easily developing trustworthy LLM-based conversational applications (tweet)
  • 04/21: China Fudan University release its 16B LLM named MOSS-003; Moss dataset contains ~1.1M text-davinci-003 generated self-instruct dataset, include ~300k plugins dataset as text-to-image/equations/.etc, fp16 finetune on 2 A100s or 4/8-bit finetune on single 3090. (repo)
  • 04/21: Phoenix, a new multilingual LLM that achieves competitive performance, vast collection of popular open source dataset (repo)
  • 04/20: UltraChat, a Informative and Diverse Multi-round Chat Data gather by THUNLP lab (repo, data)
  • 04/20: replicate ChatGLM with efficient fine-tuning (ptunig, lora, freeze) (repo); support langchain in langchain-ChatGLM project
  • 04/19: StableLM, 3B/7B LLM from StabilityAI (tweet, blog)
  • 04/18: Semantic Kernel, MSFT release its contextual memory tool like langchain/gptindex (repo)
  • 04/17: LLaVA: Large Language and Vision Assistant, Visual Instruction Tuning (blog, repo, demo)
  • 04/17: MiniGPT-4, multi-modal LLM like GPT4, consists of a vision encoder with a pretrained ViT and Q-Former, a single linear projection layer, and an advanced Vicuna large language model (blog, repo)
  • 04/17: TogetherCompute/RedPajama, reproduce LLaMA with 1.2 trillion tokens (blog, tweet)
  • 04/16: LAION-AI/Open-Assistant, is an open-source chat model(includes datasets: consists of a ~161K human-annotated assistant-style conversation corpus, including 35 different languages and annotated with ~461K quality ratings) (tweet, repo, mdoels)
  • 04/15: WebLLM, an open-source chatbot that brings LLMs like Vicuna directly onto web browsers (tweet, blog, repo)
  • 04/12: Dolly-v2-12b, Databricks release its open source Dolly-v2-12b model, derived from EleutherAI’s Pythia-12b and fine-tuned on a ~15K record instruction corpus generated by Databricks employees, which is open source as well (blog, repo, model)
  • 04/12: DeepSpeed Chat, DeepSpeed from MSFT support RLHF fine-tune with affordable haraware (blog)
  • 04/12: Text-to-SQL from self-debugging explanation component (tweet)
  • 04/11: AgentGPT, generative agents were able to simulate human-like behavior in an interactive sandbox (tweet)
  • 04/11: AutoGPT, autonomously achieve whatever goal you set (repo)
  • 04/11: Raven v8 14B released (tweet, model, repo)
  • 04/09: SVDiff, diffusion fine-tune method smaller than LoRA (tweet, repo)
  • 04/09: RPTQ, new 3 bit quantization (repo, paper)
  • 04/08: Wonder Studio, robot beat human with kongfu (tweet)
  • 04/08: chatGDB, chatgpt for GDB (tweet, repo)
  • 04/08: Vicuna-7B, small yet capable (repo), Vicuna shows impressive performance against GPT4 by lastest paper of MSFTResearch (tweet)
  • 04/07: Instruction tuning with GPT4, academic self-instruct guide from microsoft research (tweet, blog, repo, paper)
  • 04/07: ChatPipe, Orchestrating Data Preparation Program by Optimizing Human-ChatGPT Interactions (blog)
  • 04/07: Chinese-LLaMA-Alpaca release its 13B model (tweet)
  • 04/07: MathPrompter, How to chatwith GPT3 Davinci API and archive better on math benmark (paper)
  • 04/07: engshell, interact with your shell using english language (tweet)
  • 04/07: a chinese geek fine-tune a chatglm-6b model on his wechat dialogue and blog to produce a digital version of him self (tweet)
  • 04/06: StackLLaMA, A hands-on guide to train LLaMA with RLHF, fine-tuned on stack exchange QA data (tweet, blog, demo)
  • 04/06: Arxiv Chat, chat with the lastest papers (tweet)
  • 04/06: Firefly, a 1.4B/2.6B chinese chat LLM, finetune on 1.1M multi-task dataset (repo)
  • 04/06: a chinese guide of chatgpt repo
  • 04/06: LamaCleaner, segment anything and inpaint anything (tweet)
  • 04/05: SAM, Meta AI release Segment Anything Model as foundation model for image segmentation, and SA-1B dataset, which is 400x larger than any existing segmentation dataset (tweet)
  • 04/04: a beautiful cli for chatgpt (tweet)
  • 04/04: Baize, fine-tune with LoRA using 100K dialogs ChatGPT self-chat and other opensource dataset, released 7B, 13B and 30B models (repo, tweet, demo, model)
  • 04/03: Koala-13B, fine-tuned from LLaMA on user-shared conversations and open-source datasets, performs similarly to Vicuna (blog, demo, repo)
  • 04/02: LMFlow, train on single 3090 for 5 hours and get your own chatgpt (blog, repo)
  • 04/01: Alpaca-CoT, extend CoT data to Alpaca to boost its reasoning ability, provide gathered datasets (repo)
  • 04/01: Vicuna-13B, An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality, fine-tune LLaMA on ~70K conversations from ShareGPT (blog, repo, demo, data, gptq-4-bit)
  • 04/01: Twitter's Recommendation Algorithm (repo)
  • 04/01: PolyglotSiri Apple Shortcut, (tweet, repo)
  • 04/01: Official Apple Core ML Stable Diffusion Library, M-series chips beat 4090, (repo, MochiDiffusion, swift-coreml-diffusers)
  • 03/31: BloombergGPT, 50B LLM outperform existing models on financial tasks (tweet)
  • 03/31: HuggingGPT, as an interface for LLMs to connect AI Models for solving comlicated AI tasks (tweet, demo)
  • 03/31: Llama-X (repo)
  • 03/31: GPT4 UI generation (tweet)
  • 03/30: ChatExplore (tweet)
  • 03/30: ColossalChat, from ColossalAI (demo, tweet, medium, repo, serve)
  • 03/30: ChatGLM-6B, from THUDM(Tsinghua University), code and data not release (repo, model)
  • 03/29: Uncle Rabbit, the first conversational holographic AI (tweet, blog)
  • 03/29: chatgpt instead of siri (tweet)
  • 03/29: LLaMA-Adapter, fine-tuning LLaMA with 1.2M learnable parameters in 1 hour on 8 A100 (tweet, repo, demo)
  • 03/28: Chinese-LLaMA-Alpaca, add 20K chinese sentencepiece tokens to vocab and pre-trained LLaMA in 2 steps, fine-tuned LLaMA on a 2M chinese corpus using Alpaca-LoRA, 7B model released, dataset not (repo, tweet, blog, model)
  • 03/28: gpt4all, fine-tune LLaMa using LoRA with ~800k gpt3.5-turbo generations, include clean assistant data including code, stories and dialogue (repo, model, data)
  • 03/24: Dolly, Databricks fine-tune alpaca dataset on gpt-j-6b (repo)
  • 03/22: Alpaca-LoRA-Serve, gradio based chatbot service (tweet, repo)
  • 03/22: Alpaca-LoRA, reproducing the Stanford Alpaca results using low-rank adaptation(LoRA) on RTX4090 and run on a Raspberry Pi 4 (tweet, repo, demo, model, blog, reproduce tweet, zhihu, sina, explain)
  • 03/22: BELLE, fine-tune BLOOMZ-7B1-mt and LLaMA(7B/13B) on a 1.5M chinese dataset generate in a alpaca way, (repo, model)
  • 03/17: instruct-gpt-j, NLPCloud team fine-tune GPT-J using Alpaca's dataset (blog, model)
  • 03/13: Stanford Alpaca, fine-tune LLaMA 7B with a 52K single-turn instruction-followling dataset generate from OpenAI’s text-davinci-003 (blog, repo)
  • 03/11: ChatIE, solving Zero-Shot Information Extraction problem by enhancing ChatGPT with CoT prompting, achieve good performance on primary IE benchmarks (repo)
  • prompt engineering guide (blog), openai best practices (blog), prompt prefect (blog), prompt searching (repo), PromptInject (repo), auto prompt engineering (blog)
  • peft: State-of-the-art Parameter-Efficient Fine-Tuning (PEFT) methods (repo)
  • GPTQ-for-LLaMa: 4 bits quantization of LLaMA using GPTQ (repo)
  • llama.cpp: Inference of LLaMA model in pure C/C++, support different hardware platform & models, support 4-bit quantization using ggml format (repo, alpaca.cpp); support python bindings (llama-cpp-python, pyllamacpp, llamacpp-python )
  • llama_index: connect LLM with external data (repo), like langchain (repo)
  • llama-dl: high speed download of LLaMA model (repo(deprecated), model)
  • text-generation-webui: A gradio web UI for deploy LLMs like GPT-J, LLaMA (repo)
  • tldream/lama-cleaner: tiny little diffusion drawing app (repo1, repo2)
  • A1111-Web-UI-Installer: A gradio web UI for deploy stable diffusion models (repo)

@shm007g
Copy link
Owner Author

shm007g commented May 30, 2023

Instruction and Conversational Datasets

Pre-training Datasets

  • the Pile
  • RedPajama-Data-1T
  • C4
  • mC4
  • the Stack

Efficient Training Library

@shm007g
Copy link
Owner Author

shm007g commented May 30, 2023

Open Source Aligned LLMs

Model Date Base Size (B) Weight Data Licence Context Len Demo
Dolly-v2 2023/04/12 Pythia 3/6.9/12 databricks/dolly-v2-12b databricks-dolly-15k Apache-2.0 2048
Dolly-v1-6b 2023/03/24 GPT-J 6 databricks/dolly-v1-6b Stanford Alpaca Apache-2.0 2048
RWKV-4-Raven 2023/04 RWKV-4-Pile 1.5/3/7/14 BlinkDL/rwkv-4-raven Alpaca, CodeAlpaca, Guanaco, GPT4All, ShareGPT and more Apache-2.0 Infinite space
BLOOMZ & mT0 2022/11 BLOOM/mT5 0.56/1.1/1.7/3/7.1/176 bigscience/bloomz xP3 bigscience-bloom-rail-1.0 2048
OpenAssistant 2023/04/16 Pythia/LLaMA 1.4/6.9/12/30 OpenAssistant OASST1 Apache-2.0 2048 site

Reference

@shm007g
Copy link
Owner Author

shm007g commented May 30, 2023

Evaluation Benchmark

Star History

Stargazers over time

@shm007g
Copy link
Owner Author

shm007g commented May 30, 2023

Evaluation Dilemma

  • Benchmark dataset may have been collected in pre-training/post-training, which make it can't evaluate LLM properly
  • LLM could involve with specific data, evaluation only take place on one snapshot
  • Too many benchmark to evaluate if you want a comprehensive evaluation
  • Generative LLM alway need human evaluation or prompting actions, which cost too much

Evaluation In a Afforable Way

Human Exams

  • professional language proficiency exams (PaLM2)
    • Chinese: HSK汉语水平考试
    • Japanese: J-Test日本语检定
    • Frech: TCF Test...
    • Spanish: DELE C2...
    • German: Goethe-Zertifikat C2
    • Italian: PLIDA C2...
  • professional exams (GPT-4)
    • SAT
    • GRE
    • Medical Knowledge Self-Assesment Program
    • AP Art/Biology/Calculus/Chemistry/English Literature/Environment/Physics/Psychology/Statistics/History/Government...
    • ...
    • Leetcode(Easy/Midium/Hard)
  • https://github.com/microsoft/AGIEval: MSFT release a benchmark derived from 20 official, public, hign-standard qualification human exams, like gaokao, SAT. AGIEval v1.0 contains 20 tasks, including two cloze tasks (Gaokao-Math-Cloze and MATH) and 18 multi-choice question answering tasks (the rest).
  • image
  • Recommandation: Chinese HSK, Gaokao, SAT (https://github.com/microsoft/AGIEval)
  • Purpose: test for basic language and knowledge understanding
  • Evaluation Method: zero-shot multiple choice questions(Prompt+Auto), free-text response(Human)
  • Evalution Metrics: Accuracy/Scores of single/multiple choice questions

Question Answering and Classification

  • English QA and classification tasks(one-shot setting)
    • Open-domain closed-book question answering tasks: TriviaQA (Joshi et al., 2017), Natural Questions2
      (Kwiatkowski et al., 2019), and WebQuestions (Berant et al., 2013)
    • Cloze and completion tasks: LAMBADA (Paperno et al., 2016), HellaSwag (Zellers et al., 2019), and StoryCloze
      (Mostafazadeh et al., 2016)
    • Winograd-style tasks: Winograd (Levesque et al., 2012) and WinoGrande (Sakaguchi et al., 2021)
    • Reading comprehension: SQuAD v2 (Rajpurkar et al., 2018) and RACE (Lai et al., 2017)
    • Common sense reasoning: PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), and OpenBookQA (Mihaylov
      et al., 2018)
    • SuperGLUE (Wang et al., 2019)
    • Natural language inference: Adversarial NLI (ANLI; Nie et al., 2020)
    • image
  • Multilingual QA (one-shot and no-content setting): TyDi QA (Clark et al., 2020)
  • Multilingual toxicity classification
    • Toxicity classification with CivilComments
    • Multilingual toxicity classification with Jigsaw Multilingual
  • https://openai.com/research/truthfulqa: TruthfulQA, a OpenAI benchmark for Measuring how models mimic human falsehoods(based on misconceptions and biases they may have), which comprises 817 questions that span 38 categories(with a median of 7 questions and a mean of 21.5 questions per category), including health, law, finance, science and politics.
  • image
  • https://yonatanbisk.com/piqa/data/: 20,000 QA pairs that are either multiple-choice or true/false questions, main works on daily physical interaction and common sense reasoning
  • Recommandation: TruthfulQA (https://github.com/sylinrl/TruthfulQA) | PIQA (https://leaderboard.allenai.org/physicaliqa/submissions/get-started)
  • Purpose: QA is about the common test bed, TruthfulQA do falsehood/hallucinations evaluation, PIQA do common sense evaluation and frequently used;
  • Evaluation Method: BLEURT, GPT-Judge/Human | One-Shot Prompting, Accuracy(Binary Choice)
  • image

Reasoning(Common Sense, Math)

  • representative reasoning datasets in a few-shot setting: WinoGrande (Sakaguchi et al., 2021), ARC-C (Clark et al., 2018), DROP (Dua et al.,2019), StrategyQA (Geva et al., 2021), CommonsenseQA (CSQA; Talmor et al., 2019), XCOPA (Ponti et al., 2020), and BIG-Bench (BB) Hard (Suzgun et al., 2022).
    • Multilingual common sense reasoning: XCOPA
    • BIG-Bench (BB) Hard: 23 tasks from 200+, where LLM performed below average human, like multi-step arithmetic problems(multistep_arithmetic)
  • Mathematical reasoning
    • MATH (Hendrycks et al., 2021), which contains 12,500 problems from high school competitions in 7 mathematics subject areas
    • GSM8K (Cobbe et al., 2021), a dataset of 8,500 grade school math word problems
    • MGSM (Shi et al., 2023), a multilingual version of GSM8K with translations of a subset of examples into ten typologically diverse languages.
    • image
  • https://github.com/suzgunmirac/BIG-Bench-Hard: 23 challenging BIG-Bench tasks from 200+ BIG-Bench (https://github.com/google/BIG-bench);
  • image
  • Recommandation: BIG-Bench-Hard (https://github.com/suzgunmirac/BIG-Bench-Hard, https://github.com/google/BIG-bench/blob/main/bigbench/evaluate_task.py)
  • Purpose: Evaluate ability of LLM to reason, to combine multiple pieces of information, and to make logical inferences | From Paper2(Early Experiments) analysis, even GPT-4 can do only simple math right now, much arithmetic and calculation mistakes on MATH;
  • Evaluation Method: chain-of-thought (CoT)
  • Evalution Metrics: Accuracy/Score, Evaluate the truth value of a random Boolean expression consisting of Boolean constants (True, False) and basic Boolean operators (and, or, not).

Coding

  • Code Generation: 3 coding datasets: HumanEval (Chen et al., 2021), MBPP (Austin et al., 2021), and ARCADE (Yin et al., 2022),
  • Multilingual Evaluation: BabelCode (Orlanski et al., 2023) which translates HumanEval into a variety of other programming languages including c++, java, go, haskell and julia.
  • Leetcode questions(100 for each level)
  • https://github.com/openai/human-eval: HumanEval, a docstring-to-code dataset consisting of 164 coding problems that test various aspects of programming logic and proficiency
  • Recommandation: HumanEval (https://github.com/openai/human-eval), most widely used, Python code;
  • Purpose: Code language models are among the most economically significant and widely-deployed LLMs today
  • Evaluation Method: Excution in a robust security sandbox (https://github.com/openai/human-eval, https://github.com/bigcode-project/bigcode-evaluation-harness)
  • Evalution Metrics: Pass@1, Pass@K

Translation/Multi-lingual

  • WMT21 Experimental Setup: automatic metric using BLEURT, human metric using Multidimensional Quality Metrics (MQM) with hired professional translators
  • Recommandation:
  • Purpose:
  • Evaluation Method:
  • Evalution Metrics:

Natural language generation

测试任务 详细样例 样例数 中文Alpaca-7B 中文Alpaca-13B 中文Alpaca-Plus-7B
💯总平均分 - 200 65.1 70.6 👍🏻75.3
知识问答 QA.md 20 66 74 👍🏻80
开放式问答 OQA.md 20 👍🏻79 74 👍🏻78
数值计算、推理 REASONING.md 20 31 👍🏻50 45
诗词、文学、哲学 LITERATURE.md 20 68 73 👍🏻76
音乐、体育、娱乐 ENTERTAINMENT.md 20 68 74 👍🏻79
写信、写文章 GENERATION.md 20 76 👍🏻81 👍🏻81
文本翻译 TRANSLATION.md 20 76 78 👍🏻82
多轮交互 DIALOGUE.md 20 👍🏻83 73 👍🏻84
代码编程 CODE.md 20 57 👍🏻64 59
伦理、拒答 ETHICS.md 20 47 65 👍🏻89

LLM Arena (side by side comparison)

@shm007g
Copy link
Owner Author

shm007g commented Jun 1, 2023

<!DOCTYPE html>
<style>
    body {
      background-image: url('https://boson.ai/bg.jpg');
      background-attachment: fixed;
      background-size: cover;
    }

    .header {
        padding-top: 50px;
        margin: auto;
        width: 60%;
        font-family: Arial,Helvetica,sans-serif;
    }


    h1 {
        font-size: 110px;
        color: #fff;
    }
    h2 {
        font-size: 80px;
        color: #aaa;
    }


    </style>

<html>
<body>

    <div class="header">
    <h1>Large Models for All</h1>
    <h2>We're building something big...</h2>
    <h2>Stay tuned! </h2>
    
    <a href="https://github.com/shm007g/LLaMA-Cult-and-More">
        <picture>
            <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=shm007g/LLaMA-Cult-and-More&type=Date&theme=dark" />
            <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=shm007g/LLaMA-Cult-and-More&type=Date" />
            <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=shm007g/LLaMA-Cult-and-More&type=Date" />

        </picture>
    </a>

    <input type="hidden" id="thanks" name="to" value="https://github.com/boson-ai">
    <h2></h2>

    <div>

</body>
</html>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant