-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is left-padding in PPO strictly necessary? #232
Comments
Yes this is necessary to reduce the pad of the training samples Prompt samples with left padding allow us to remove PAD on both sides and then dynamically padding depending on batch training samples |
Ah, you mean |
This will lead to a lot of pads in the middle |
Ah, no, to be clear, what I mean is the following:
What I have in mind is instead to do this:
So, less padding overall, and no padding in the middle. The only thing that is now a little different is that the index where the prompt stops and the response starts isn't the same for each sequence - will that break anything? |
they are
|
That's not what I am proposing though! What I mean is, if I return it without the pads in the middle from |
I noticed that RemoteExperienceMaker left-pads the input sequences even when using vllm for generation:
OpenRLHF/openrlhf/trainer/ppo_utils/experience_maker.py
Line 346 in dcd379a
I can see that a few lines down,
self.actor.process_sequences()
assumes this left-padding, as it calculates an action mask in a way that hinges on all the inputs terminating at the same index.Other than that, are there any other parts of the code that assume that inputs are left-padded, respectively that inputs always terminate on the same index?
If not, I'd like to open a PR that skips the left-padding and calculates action mask directly - the left-padding can be inefficient, and action-masking this way doesn't generalise to multi-turn conversations.
The text was updated successfully, but these errors were encountered: