You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We propose to train a Reinforcement Learning from Human Feedback (RLHF) model using CarperAI Trlx, a distributed training framework designed to fine-tune large language models with reinforcement learning. Our goal is to improve the conversational abilities of language models and create a chatbot that can better engage with humans.
🚀 Methods:
We will train an initial model using supervised fine-tuning, where human AI trainers will provide conversations in which they play both sides- the user and an AI assistant. We will give the trainers access to model-written suggestions to help them compose their responses. We will mix this new dialogue dataset with the InstructGPT dataset, which we will transform into a dialogue format.
To create a reward model for reinforcement learning, we will collect comparison data, which will consist of two or more model responses ranked by quality. To collect this data, we will take conversations that AI trainers had with the chatbot. We will randomly select a model-written message, sample several alternative completions, and have AI trainers rank them. Using these reward models, we will fine-tune the model using Proximal Policy Optimization. We will perform several iterations of this process.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
📢 Proposal: Train RLHF using CarperAI Trlx 🤖
Trix
We propose to train a Reinforcement Learning from Human Feedback (RLHF) model using CarperAI Trlx, a distributed training framework designed to fine-tune large language models with reinforcement learning. Our goal is to improve the conversational abilities of language models and create a chatbot that can better engage with humans.
🚀 Methods:
We will train an initial model using supervised fine-tuning, where human AI trainers will provide conversations in which they play both sides- the user and an AI assistant. We will give the trainers access to model-written suggestions to help them compose their responses. We will mix this new dialogue dataset with the InstructGPT dataset, which we will transform into a dialogue format.
To create a reward model for reinforcement learning, we will collect comparison data, which will consist of two or more model responses ranked by quality. To collect this data, we will take conversations that AI trainers had with the chatbot. We will randomly select a model-written message, sample several alternative completions, and have AI trainers rank them. Using these reward models, we will fine-tune the model using Proximal Policy Optimization. We will perform several iterations of this process.
💽 Datasets:
https://huggingface.co/datasets/Anthropic/hh-rlhf
https://huggingface.co/datasets/HuggingFaceH4/helpful-anthropic-raw
https://www.surgehq.ai/datasets/instructgpt-style-dataset
Beta Was this translation helpful? Give feedback.
All reactions