Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AdaptLLM] How to evaluate the models' performance? #223

Open
yiyiwwang opened this issue May 17, 2024 · 1 comment
Open

[AdaptLLM] How to evaluate the models' performance? #223

yiyiwwang opened this issue May 17, 2024 · 1 comment

Comments

@yiyiwwang
Copy link

Dear authors, I'm reading your paper Adapting Large Language Models via Reading Comprehension. I have some questions to disturb you.

Could you please tell me how to evaluate your Biomedicine/finance/law AdaptLLM models? I know maybe I can evaluate the benchmark PubMedQA with lm-evaluation-harness, but how to evaluate the other datasets like ChemProt, ConvFinQA, and so on?

Another questions is that it seems that there are some repetition in the datasets, such as the first three items in the test set of ChemProt shows:
image
The contents seem to be the same although they are not identical. Are they repetitions? Do we need to remove the repetition?

@cdxeve
Copy link
Contributor

cdxeve commented May 18, 2024

Hi,

Regarding the evaluation

You can implement the evaluation code using the lm-eval-harness framework. We have provided pre-templatized input instructions and output completions for each domain-specific task on Hugging Face:

Biomedicine tasks
Finance tasks
Law tasks

For multiple-choice tasks (including RCT, ChemProt, MQP, USMLE, PubMedQA, Headline, FPB, FiQA_SA, SCOTUS, CaseHold, UnfairToS), you can follow any multiple-choice task (e.g., SIQA) in lm-eval-harness. A helpful guideline is available here.

For text completion tasks (including ConvFinQA, NER), you can follow the example of text completion tasks like SQuADv2.

Regarding the repetition in the datasets

Thank you for your careful review! The raw ChemProt dataset we used is from the DAPT repository. We had not noticed this issue before, but the repetition is likely acceptable for our experiments. Therefore, you do not need to remove the repetitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants