May 7, 2026
BLEU Score - Machine Translation Metric.
How the BLEU Score is used to evaluate machine translations. An old fundamental metric.
Pieter Bruegel the Elder (1563) The Tower of Babel.A fundamental metric commonly used in Machine Translation evaluation is the BLEU score. It was introduced at IBM in 2002 by Papineni et al.
In 2002, as translation tools were being developed (ie. Google Translation - 2006), scientists wanted a way to evaluate if a machine translation from one language to another was “accurate” to how a human would perform the task.
Human Evaluation was expensive, so Papineni et al. developed a metric that was simple and accurate (Between 0 and 1). The BLEU Score.
Given a French Sentence: Je mange une pomme
A human gives a “perfect” translation in English (Reference): I am eating an apple.
A computer gives a translation in English (Candidate): I am an eating apple.
How do we know how well the machine did? Normally a human can evaluate the machine and give it rating on its translation performance. This is what the BLEU metric does:
It compares the Candidate to the Reference using n-grams, specifically 1-gram, 2-gram, 3-gram, and 4-grams.
Here are all the 1-grams in the candidate: {”I”, “am”, “an”, “eating”, “apple”}
Here are all the 2-grams in the candidate: {”I am”, “am an”, “an eating”, “eating apple”}
Here are all the 3-grams in the candidate: {”I am an”, “am an eating”, “an eating apple”}
Here are all the 4-grams in the candidate: {”I am an eating”, “am an eating apple”}
We will compare how many candidates in n-grams (1-4) match the reference:
1-grams: all 5 match = 5/5 = 1.0
2-grams: 2 match = 2/4 = 0.5
3-grams: 0 match = 0/3 = 0.0
4-grams: 0 match = 0/2 = 0.0
This is expressed in the following formulation, where n is the n-gram, C is the candidate.
Finally we calculate the geometric mean of the values and multiply the value by the Brevity Penalty (I’ll explain this later).
Which is expressed as the following formulation.
We try to express formulation in the log sum form instead of product form since it is easier for computers to calculate and allows us to handle larger numbers without risking overflows/underflows or floating point errors.
So in this case, the BLEU score is 0, which means the computer did a terrible job translating.
So now that you have an idea of how this score is calculated, let’s explain the Brevity Penalty (). Let’s assume the Candidate is just “I am”, if you were to go through the same process, you would get the following:
1-grams: all 2 match = 2/2 = 1.0
2-grams: 1 match = 1/1 = 1
3-grams: No 3-grams
4-grams: No 4-grams
Now calculate the geometric mean:
It gets a BLEU score of 1, the highest score, which seems really wrong, how come “I am” is a really good translation in english from “Je mange une pomme”? As you can see, a machine can trick the BLEU score into scoring really highly by giving it really short answers, like one word candidate answers.
So in order to balance it out, if the candidate is shorter than the reference, Papineni et al. came up with a formula to punish short responses.
Basicaly if the candidate is shorter than the reference, the Brevity Penalty will be equal to ( = reference, = candidate ), otherwise, we don’t provide a penalty and set the . I’m not sure why it used that specific scaling method for short candidates, but I’m sure you can find it in the paper.
Additional Comments
If we find duplicate matches, we ignore them (This is known as Clipping and you can find details about it in the paper). This can exist if a candidate or reference have a sentence with the same word more than once.
Why do we use 1 to 4-grams, not more? We can, but most implementations stick to up to 4-grams since it gives us a good enough answer. Any more and there is a high chance you will get a lot of BLEU scores of 0.
What is the base-e exponential, summation, , and logarithm? is just another form to say If it is still confusing, you can look up “Log Sum Form for Geometric Mean” and that might help clarify things.
References
Papineni, K. et al. (2002) BLEU: A method for automatic evaluation of machine translation, Aclanthology.org.
Built with Next.js and Tailwind CSS. Made with ❤️ by Justin Zhang.