You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Regarding adding subtitles, I still have the following questions:
If you do not use subtitles for training, and do not change other model architecture and designs, in other words, for video tokens, only the sequence of <image_i> is used. Can the model understand long videos? Or does the model have the ability to find a needle in a haystack or answer detailed questions about a hour long video?
If subtitles are not added, there is obviously an order of magnitude difference between the number of input visual tokens and the number of text tokens. Will such an imbalance affect the effect of the model?
Afte adding subtitles for training, can you infer videos without subtitles? If so, how to inference? How to set up the subtitles?
Thanks.
The text was updated successfully, but these errors were encountered:
Thank you for your great work.
Regarding adding subtitles, I still have the following questions:
Thanks.
The text was updated successfully, but these errors were encountered: