the difference of your bleu and sacrebleu #558

cooper12121 · 2024-03-07T12:37:41Z

What is the difference between your package's bleu implementation and sacrebleu implementation? I calculated the result differently in the two ways, Chinese expected, passed sacrebleu's zh tokenizer

shenxiangzhuang · 2024-04-24T02:39:51Z

I believe there are some differences between the implementation and sacrebleu's. Actruly, testing with English has the same problem.

evaluate

import evaluate


predictions = ["hello there general kenobi", "foo bar foobar"]
references = [
    ["hello there general kenobi", "hello there !"],
    ["foo bar foobar"]
] 

bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references, smooth=False, max_order=4)
print(results)

got results:

{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6}

sacrebleu

from sacrebleu.metrics import BLEU


predictions = ["hello there general kenobi", "foo bar foobar"]
references = [
                 ["hello there general kenobi", "hello there !"],
                 ["foo bar foobar"]
             ]

bleu = BLEU(smooth_method="none", max_ngram_order=4, tokenize='13a')
results = bleu.corpus_score(predictions, references)
print(results)

got results:

BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 4 ref_len = 4)

shenxiangzhuang mentioned this issue Apr 24, 2024

Different results with sacrebleu shenxiangzhuang/bleuscore#30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the difference of your bleu and sacrebleu #558

the difference of your bleu and sacrebleu #558

cooper12121 commented Mar 7, 2024

shenxiangzhuang commented Apr 24, 2024

the difference of your bleu and sacrebleu #558

the difference of your bleu and sacrebleu #558

Comments

cooper12121 commented Mar 7, 2024

shenxiangzhuang commented Apr 24, 2024

evaluate

sacrebleu