Introduction to BLEU in Python
BLEU is a metric to quantify effectiveness of an Machine Translation (MT). It stands for BiLingual Evaluation Understudy $^{[1]}$. In general it solves the problem of different human translation references by different annotators when comparing to machine generated translation. Let’s start from using BLEU in NLTK translation package $^{[2]}$
Assume we have translation references, generated by 4 annotators:
- this is a ship
- it is ship
- ship it is
- a ship, it is # Yoda style
and our MT system generate:
it is ship
Then we asked: how good is this? Given all difference references by human annotators?
import warnings
warnings.filterwarnings('ignore')
import nltk.translate.bleu_score as bleu
# split will transform the sentence into 1-gram
reference = [
'this is a ship'.split(),
'it is ship'.split(),
'ship it is'.split(),
'a ship, it is'.split() # master Yoda
]
translation = 'it is ship'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))
BLEU score: 1.0
What does it means?
the sentence_bleu
method produces a continuous number from $[0, 1]$. BLEU score $1.0$ means that the machine output has exactly matched one of referenced human translation and it quantitatively shows that it is a good translation. Let’s take another translation example
translation = 'it is a ship'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))
BLEU score: 0.8408964152537145
“it is a ship” in English should also be categorized as a good translation right? However since there are no referenced said exactly like this, it got lower score.
The Problem of BLEU score
Let’s again look at these example
translation = 'it'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))
translation = 'it it it it it it it'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))
translation = 'it a b c d e f g h i j k l m n'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))
translation = 'ship ship ship'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))
translation = 'it ship'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))
BLEU score: 0.1353352832366127
BLEU score: 0.6147881529512643
BLEU score: 0.6042750794713536
BLEU score: 0.7598356856515925
BLEU score: 0.6065306597126334
these are bad translations, but would result such a high BLEU score! Why is that? Let’s go back a little on how to calculate BLEU score
BLEU From Scratch
Looking at BLEU paper $^{[1]}$ and Andrew Ng’s video $^{[2]}$ we know that we should split our implementations into components
import numpy as np
from nltk import ngrams
from collections import Counter
Counting Word Vectors
Matching word vector count with all candidate or one of the candidate is an intuitive and simple way to match translation candidate / reference with hypothesis so we will start from here.
def count_ngram(unigram, ngram=1):
"""
Return
-----
counter: dict, containing ngram as key, and count as value
"""
return Counter(ngrams(unigram, ngram))
test_unigram = 'this is is a ship'.split()
res = count_ngram(test_unigram)
assert res[('is',)] == 2
assert res[('a',)] == 1
print('res: {}'.format(res))
res: Counter({('is',): 2, ('this',): 1, ('ship',): 1, ('a',): 1})
Count Clip
Now we know that we could have several translation candidate for 1 hypothesis. Counting word vector will not be enough as we could have different word vector for each of translation candidate / reference. We wanted to normalize this word vector that we have obtained with word vector that will represent all references that we have.
One way to construct word vector from references is to find the maximum number of ngram occurs in on each reference instead of summing or averaging. We call this $\text{Max_Ref_Count}$. Now we want to “clip” count for each gram from translation.
\[Count_{clip} = min(Count, \text{Max_Ref_Count})\]def count_clip_ngram(translation_u, list_of_reference_u, ngram=1):
"""
Return
----
"""
res = dict()
# retrieve hypothesis counts
ct_translation_u = count_ngram(translation_u, ngram=ngram)
# retrieve translation candidate counts
for reference_u in list_of_reference_u:
ct_reference_u = count_ngram(reference_u, ngram=ngram)
for k in ct_reference_u:
if k in res:
res[k] = max(ct_reference_u[k], res[k])
else:
res[k] = ct_reference_u[k]
return {
k: min(ct_translation_u.get(k, 0), res.get(k, 0))
for k in ct_translation_u
}
translation = 'is some'
references = [
'this is a test',
'this is is is test'
]
res = count_clip_ngram(
translation.split(),
list(map(lambda ref: ref.split(), references))
)
assert res[('is',)] == 1
assert res[('some',)] == 0
print('result from count clip: {}'.format(res))
result from count clip: {('is',): 1, ('some',): 0}
Modified Precision
The modified precission ($p_n$, where $n$ is $n$-gram) is defined by
\[p_n = \frac{ \sum_{C \space \in \space \{Candidates\}} \sum_{n\text{-gram} \space \in \space C} Count_{clip}(n\text{-gram)} }{ \sum_{C' \space \in \space \{Candidates\}} \sum_{n\text{-gram}' \space \in \space C'} Count(n\text{-gram')} }\]or in english language:
sum of all count clipped of n-gram of (translation and references) divided by sum of count n-gram from translation
def modified_precision(translation_u, list_of_reference_u, ngram=1):
ct_clip = count_clip_ngram(translation_u, list_of_reference_u, ngram)
ct = count_ngram(translation_u, ngram)
return sum(ct_clip.values()) / float(max(sum(ct.values()), 1))
Brevity Penalty
Now we wanted to show that we will reward translation that matches one of reference candidate length, not shorted and not longer. Since references length may be difference, we could still find and use the closest/effective reference length. Here the term Brevity Penalty ($BP$) comes \(BP \space \in \space [0, 1]\)
and we make $BP$ function to be
\[BP = \begin{cases} 1 & \text{if} \space c > r \\ e^{(1-r/c)} & \text{if} \space c \leq r \end{cases}\]where
$c$: length of candidate translation
$r$: effective reference length
We don’t penalize if $c$ larger than candidate translation because it has already penalized by modified precision. To be exact, because in modified precision when duplicate but relevant word occurs more than it supposed to or when the number of irrelevant word occurs more, it will results in larger denominator thus pulling the $BP$ towards 0.
We now want to show implementation of $BP$ in code.
def closest_ref_length(translation_u, list_of_reference_u):
"""
determine the closest reference length from translation length
"""
len_trans = len(translation_u)
closest_ref_idx = np.argmin([abs(len(x) - len_trans) for x in list_of_reference_u])
return len(list_of_reference_u[closest_ref_idx])
def brevity_penalty(translation_u, list_of_reference_u):
"""
"""
c = len(translation_u)
r = closest_ref_length(translation_u, list_of_reference_u)
if c > r:
return 1
else:
return np.exp(1 - float(r)/c)
The Bleu Score
Calculating BLEU after defining steps above is straightforward!
\[BLEU = BP.\exp(\sum^{N}_{n=1} w_n \log p_n)\]where:
$N$: maximum number of $n$-gram. Usually 4.
$w_n$: weight for each modified precision. By default use $1/N$, so 0.25 each gram.
$p_n$: modified precision for each gram
def bleu_score(translation_u, list_of_reference_u, W=[0.25 for x in range(4)]):
bp = brevity_penalty(translation_u, list_of_reference_u)
modified_precisions = [
modified_precision(translation_u, list_of_reference_u, ngram=ngram)
for ngram, _ in enumerate(W,start=1)
]
score = np.sum([
wn * np.log(modified_precisions[i]) if modified_precisions[i] != 0 else 0 for i, wn in enumerate(W)
])
return bp * np.exp(score)
assert bleu_score(
translation.split(),
list(map(lambda ref: ref.split(), references)),
) == bleu.sentence_bleu(
list(map(lambda ref: ref.split(), references)),
translation.split(),
)
print('All is good!')
All is good!
Conclusion
We have understand how to construct BLEU score for difference human reference and evaluate a hypothesis translation and we have seen that it did good job in scoring the hypothesis translation with this kind of setting.
BLEU might produce relatively good score to bad translation. In one of Phillip Koehn et al$^{[4]}$ work, they have explained phenomenons that makes BLEU flawed. However, it still is a good metric, used to evaluate machine translation performance up until now.