Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feeding the model separate examples instead of one continuous block of text #17

Open
CupOfGeo opened this issue Oct 26, 2021 · 1 comment

Comments

@CupOfGeo
Copy link

Hello I'm interested in adding this feature anding a function in text2csv.py to take a folder of texts and then in run_clm.py pad and truncate them instead of the group_text function.

@CupOfGeo
Copy link
Author

I'm using songs for my data the line new line spacing is important and i would like them to be separate while fine tuning so the end of one song isn't the start of another.
I have it create the csv's so that each row is a song but then when it gets group_text applied to it it concatenates them all and make blocks of 1024. looking into trynig to add the DataCollatorWithPadding but not having much luck at the moment

i also notice that its using <|endoftext|> as bos_token and eos_token wondering how that would affect things and if what im doing is even needed if or if i should just have theses tokens between my examples.
from the config.json in the model
"bos_token_id": 50256,
"embed_dropout": 0,
"eos_token_id": 50256,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant