You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all congrats on the paper and thanks for providing the code!
In the paper at 'Zero-shot language-based multi-modal joint retrieval' you mention that integrating/combining multiple embeddings improves the performance. I am specifically referring to the sentence:
'Similar trends have been observed in other modalities, where each modality has the potential to enhance the performance when combined with other modalities.'
However, the paper does not clarify how the embeddings for different modalities are actually combined. If for instance, the input modalities are text, audio, video and depth the model would produce individual embeddings for all of the modalities. How do you then combine these embeddings in order to obtain the results you report?
Do you simply average the different embeddings?
Thanks in advance,
Anthony Mendil.
The text was updated successfully, but these errors were encountered:
Is the code for this available? I can not seem to locate it in the repository. If not could you perhaps provide it? For example for the Infrared+RGB -> Text task.
And is there is specific reason to average the logits and not directly the produced embeddings of the modalities? Especially for the retrieval task there are no logits computed if I understand correctly. How would this be done without the logits?
First of all congrats on the paper and thanks for providing the code!
In the paper at 'Zero-shot language-based multi-modal joint retrieval' you mention that integrating/combining multiple embeddings improves the performance. I am specifically referring to the sentence:
'Similar trends have been observed in other modalities, where each modality has the potential to enhance the performance when combined with other modalities.'
However, the paper does not clarify how the embeddings for different modalities are actually combined. If for instance, the input modalities are text, audio, video and depth the model would produce individual embeddings for all of the modalities. How do you then combine these embeddings in order to obtain the results you report?
Do you simply average the different embeddings?
Thanks in advance,
Anthony Mendil.
The text was updated successfully, but these errors were encountered: