Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] How to have no preset values sent into .compute() #570

Open
alvations opened this issue Apr 8, 2024 · 0 comments
Open

[Question] How to have no preset values sent into .compute() #570

alvations opened this issue Apr 8, 2024 · 0 comments

Comments

@alvations
Copy link

alvations commented Apr 8, 2024

We've a use-case https://huggingface.co/spaces/alvations/llm_harness_mistral_arc/blob/main/llm_harness_mistral_arc.py

where default feature input types for evaluate.Metric is nothing and we get something like this in our llm_harness_mistral_arc/llm_harness_mistral_arc.py

import evaluate
import datasets
import lm_eval


@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class llm_harness_mistral_arc(evaluate.Metric):
    def _info(self):
        # TODO: Specifies the evaluate.EvaluationModuleInfo object
        return evaluate.MetricInfo(
            # This is the description that will appear on the modules page.
            module_type="metric",
            description="",
            citation="",
            inputs_description="",
            # This defines the format of each prediction and reference
            features={},
        )

    def _compute(self, pretrained=None, tasks=[]):
        outputs = lm_eval.simple_evaluate( 
              model="hf",
              model_args={"pretrained":pretrained},
              tasks=tasks,
              num_fewshot=0,
          )
        results = {}
        for task in outputs['results']:
          results[task] = {'acc':outputs['results'][task]['acc,none'], 
                          'acc_norm':outputs['results'][task]['acc_norm,none']}
        return results

And in our expected user-behavior is something like, [in]:

import evaluate

module = evaluate.load("alvations/llm_harness_mistral_arc")
module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2", tasks=["arc_easy"])

And the expected output as per our tests.py, https://huggingface.co/spaces/alvations/llm_harness_mistral_arc/blob/main/tests.py [out]:

{'arc_easy': {'acc': 0.8131313131313131, 'acc_norm': 0.7680976430976431}}

But the evaluate.Metric.compute() somehow expects a default batch and module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2", tasks=["arc_easy"]) throws an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-20-bd94e5882ca5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2",
      2                tasks=["arc_easy"])

2 frames
[/usr/local/lib/python3.10/dist-packages/evaluate/module.py](https://localhost:8080/#) in _get_all_cache_files(self)
    309         if self.num_process == 1:
    310             if self.cache_file_name is None:
--> 311                 raise ValueError(
    312                     "Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` "
    313                     "at least once before calling `compute`."

ValueError: Evaluation module cache file doesn't exist. Please make sure that you call `add` or `add_batch` at least once before calling `compute`.

Q: Is it possible for the .compute() to expect no features?

I've also tried this but somehow the evaluate.Metric.compute is still looking for some sort of predictions variable.

@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
class llm_harness_mistral_arc(evaluate.Metric):
    def _info(self):
        # TODO: Specifies the evaluate.EvaluationModuleInfo object
        return evaluate.MetricInfo(
            # This is the description that will appear on the modules page.
            module_type="metric",
            description="",
            citation="",
            inputs_description="",
            # This defines the format of each prediction and reference
            features=[
                datasets.Features(
                    {
                        "pretrained": datasets.Value("string", id="sequence"),
                        "tasks": datasets.Sequence(datasets.Value("string", id="sequence"), id="tasks"),
                    }
                )]
        )

    def _compute(self, pretrained, tasks):
        outputs = lm_eval.simple_evaluate( 
              model="hf",
              model_args={"pretrained":pretrained},
              tasks=tasks,
              num_fewshot=0,
          )
        results = {}
        for task in outputs['results']:
          results[task] = {'acc':outputs['results'][task]['acc,none'], 
                          'acc_norm':outputs['results'][task]['acc_norm,none']}
        return results

then:

import evaluate

module = evaluate.load("alvations/llm_harness_mistral_arc")
module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2", tasks=["arc_easy"])

[out]:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-36-bd94e5882ca5>](https://localhost:8080/#) in <cell line: 1>()
----> 1 module.compute(pretrained="mistralai/Mistral-7B-Instruct-v0.2",
      2                tasks=["arc_easy"])

3 frames
[/usr/local/lib/python3.10/dist-packages/evaluate/module.py](https://localhost:8080/#) in _infer_feature_from_example(self, example)
    606             f"Predictions and/or references don't match the expected format.\n"
    607             f"Expected format:\n{feature_strings},\n"
--> 608             f"Input predictions: {summarize_if_long_list(example['predictions'])},\n"
    609             f"Input references: {summarize_if_long_list(example['references'])}"
    610         )

KeyError: 'predictions'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant