Run Langchain Evaluations on data in Langfuse , Why is the prompt not considered, and could this lead to evaluation flaws? #1649

pengpengIlove · 2024-04-10T05:25:28Z

pengpengIlove
Apr 10, 2024

Text Description:

In reviewing the introduction section of the tool, we noticed that only input and output parameters are mentioned, with no information about the prompt content. Why is this the case, and could this omission lead to issues in the evaluation process? Resolving this is urgent for our company.

Code Section:

The provided code defines a function execute_eval_and_score() that iterates through a collection named generations, performing an evaluation for each element. The evaluation criteria are determined by the EVAL_TYPES dictionary, excluding the type named "hallucination". For each criterion, it uses the get_evaluator_for_key function to obtain the corresponding evaluator and calls the evaluate_strings method to evaluate the output of each generation. The results of the evaluation are printed, and the scores along with the reasoning are logged using the langfuse.score method.

Revised code explanation:

def execute_eval_and_score():
    # Iterate through each generation instance
    for generation in generations:
        # Determine the criteria for evaluation, excluding 'hallucination'
        criteria = [key for key, value in EVAL_TYPES.items() if value and key != "hallucination"]
 
        # Evaluate based on each criterion
        for criterion in criteria:
            # Invoke the evaluator for the criterion to perform evaluation
            eval_result = get_evaluator_for_key(criterion).evaluate_strings(
                prediction=generation.output,
                input=generation.input,
            )
            print(eval_result)

            # Log the score and reasoning using langfuse.score
            langfuse.score(
                name=criterion,
                trace_id=generation.trace_id,
                observation_id=generation.id,
                value=eval_result["score"],
                comment=eval_result['reasoning']
            )

execute_eval_and_score()

marcklingen · 2024-04-10T22:40:26Z

marcklingen
Apr 10, 2024
Maintainer

hi @pengpengIlove, happy to help. input usually includes the prompt, see an example here: https://cloud.langfuse.com/project/clkpwwm0m000gmm094odg11gi/traces/397a5468-8fb8-46fe-8679-937d00f9c2b0?observation=a4d08b77-fc7e-4c6a-90da-95dd2c69ce2e

1 reply

marcklingen Apr 29, 2024
Maintainer

@pengpengIlove have you tried this again? Happy to help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Langfuse

Run Langchain Evaluations on data in Langfuse , Why is the prompt not considered, and could this lead to evaluation flaws? #1649

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Langfuse

Run Langchain Evaluations on data in Langfuse , Why is the prompt not considered, and could this lead to evaluation flaws? #1649

pengpengIlove Apr 10, 2024

Replies: 1 comment · 1 reply

marcklingen Apr 10, 2024 Maintainer

marcklingen Apr 29, 2024 Maintainer

pengpengIlove
Apr 10, 2024

Replies: 1 comment 1 reply

marcklingen
Apr 10, 2024
Maintainer

marcklingen Apr 29, 2024
Maintainer