You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could you please tell me how to evaluate your Biomedicine/finance/law AdaptLLM models? I know maybe I can evaluate the benchmark PubMedQA with lm-evaluation-harness, but how to evaluate the other datasets like ChemProt, ConvFinQA, and so on?
Another questions is that it seems that there are some repetition in the datasets, such as the first three items in the test set of ChemProt shows:
The contents seem to be the same although they are not identical. Are they repetitions? Do we need to remove the repetition?
The text was updated successfully, but these errors were encountered:
You can implement the evaluation code using the lm-eval-harness framework. We have provided pre-templatized input instructions and output completions for each domain-specific task on Hugging Face:
For multiple-choice tasks (including RCT, ChemProt, MQP, USMLE, PubMedQA, Headline, FPB, FiQA_SA, SCOTUS, CaseHold, UnfairToS), you can follow any multiple-choice task (e.g., SIQA) in lm-eval-harness. A helpful guideline is available here.
For text completion tasks (including ConvFinQA, NER), you can follow the example of text completion tasks like SQuADv2.
Regarding the repetition in the datasets
Thank you for your careful review! The raw ChemProt dataset we used is from the DAPT repository. We had not noticed this issue before, but the repetition is likely acceptable for our experiments. Therefore, you do not need to remove the repetitions.
Dear authors, I'm reading your paper Adapting Large Language Models via Reading Comprehension. I have some questions to disturb you.
Could you please tell me how to evaluate your Biomedicine/finance/law AdaptLLM models? I know maybe I can evaluate the benchmark PubMedQA with lm-evaluation-harness, but how to evaluate the other datasets like ChemProt, ConvFinQA, and so on?
Another questions is that it seems that there are some repetition in the datasets, such as the first three items in the test set of ChemProt shows:
The contents seem to be the same although they are not identical. Are they repetitions? Do we need to remove the repetition?
The text was updated successfully, but these errors were encountered: