[AdaptLLM] How to evaluate the models' performance? #223

yiyiwwang · 2024-05-17T07:08:12Z

Dear authors, I'm reading your paper Adapting Large Language Models via Reading Comprehension. I have some questions to disturb you.

Could you please tell me how to evaluate your Biomedicine/finance/law AdaptLLM models? I know maybe I can evaluate the benchmark PubMedQA with lm-evaluation-harness, but how to evaluate the other datasets like ChemProt, ConvFinQA, and so on?

Another questions is that it seems that there are some repetition in the datasets, such as the first three items in the test set of ChemProt shows:

The contents seem to be the same although they are not identical. Are they repetitions? Do we need to remove the repetition?

cdxeve · 2024-05-18T10:28:35Z

Hi,

Regarding the evaluation

You can implement the evaluation code using the lm-eval-harness framework. We have provided pre-templatized input instructions and output completions for each domain-specific task on Hugging Face:

Biomedicine tasks
Finance tasks
Law tasks

For multiple-choice tasks (including RCT, ChemProt, MQP, USMLE, PubMedQA, Headline, FPB, FiQA_SA, SCOTUS, CaseHold, UnfairToS), you can follow any multiple-choice task (e.g., SIQA) in lm-eval-harness. A helpful guideline is available here.

For text completion tasks (including ConvFinQA, NER), you can follow the example of text completion tasks like SQuADv2.

Regarding the repetition in the datasets

Thank you for your careful review! The raw ChemProt dataset we used is from the DAPT repository. We had not noticed this issue before, but the repetition is likely acceptable for our experiments. Therefore, you do not need to remove the repetitions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AdaptLLM] How to evaluate the models' performance? #223

[AdaptLLM] How to evaluate the models' performance? #223

yiyiwwang commented May 17, 2024

cdxeve commented May 18, 2024

[AdaptLLM] How to evaluate the models' performance? #223

[AdaptLLM] How to evaluate the models' performance? #223

Comments

yiyiwwang commented May 17, 2024

cdxeve commented May 18, 2024