Introduction to BLEU in Python

BLEU is a metric to quantify effectiveness of an Machine Translation (MT). It stands for BiLingual Evaluation Understudy $^{[1]}$. In general it solves the problem of different human translation references by different annotators when comparing to machine generated translation. Let’s start from using BLEU in NLTK translation package $^{[2]}$

Assume we have translation references, generated by 4 annotators:

this is a ship
it is ship
ship it is
a ship, it is # Yoda style

yoda

and our MT system generate:

it is ship

Then we asked: how good is this? Given all difference references by human annotators?

import warnings
warnings.filterwarnings('ignore')

import nltk.translate.bleu_score as bleu

# split will transform the sentence into 1-gram
reference = [
    'this is a ship'.split(),
    'it is ship'.split(),
    'ship it is'.split(),
    'a ship, it is'.split() # master Yoda
]

translation = 'it is ship'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))

BLEU score: 1.0

What does it means?

the sentence_bleu method produces a continuous number from $[0, 1]$. BLEU score $1.0$ means that the machine output has exactly matched one of referenced human translation and it quantitatively shows that it is a good translation. Let’s take another translation example

translation = 'it is a ship'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))

BLEU score: 0.8408964152537145

“it is a ship” in English should also be categorized as a good translation right? However since there are no referenced said exactly like this, it got lower score.

The Problem of BLEU score

Let’s again look at these example

translation = 'it'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))

translation = 'it it it it it it it'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))


translation = 'it a b c d e f g h i j k l m n'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))

translation = 'ship ship ship'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))

translation = 'it ship'.split()
print('BLEU score: {}'.format(bleu.sentence_bleu(reference, translation)))

BLEU score: 0.1353352832366127
BLEU score: 0.6147881529512643
BLEU score: 0.6042750794713536
BLEU score: 0.7598356856515925
BLEU score: 0.6065306597126334

escalated

these are bad translations, but would result such a high BLEU score! Why is that? Let’s go back a little on how to calculate BLEU score

BLEU From Scratch

Looking at BLEU paper $^{[1]}$ and Andrew Ng’s video $^{[2]}$ we know that we should split our implementations into components

import numpy as np
from nltk import ngrams
from collections import Counter

Counting Word Vectors

Matching word vector count with all candidate or one of the candidate is an intuitive and simple way to match translation candidate / reference with hypothesis so we will start from here.

def count_ngram(unigram, ngram=1):
    """
    Return
    -----
    counter: dict, containing ngram as key, and count as value
    """
    return Counter(ngrams(unigram, ngram))

test_unigram = 'this is is a ship'.split()
res = count_ngram(test_unigram)

assert res[('is',)] == 2
assert res[('a',)] == 1

print('res: {}'.format(res))

res: Counter({('is',): 2, ('this',): 1, ('ship',): 1, ('a',): 1})

Count Clip

Now we know that we could have several translation candidate for 1 hypothesis. Counting word vector will not be enough as we could have different word vector for each of translation candidate / reference. We wanted to normalize this word vector that we have obtained with word vector that will represent all references that we have.

One way to construct word vector from references is to find the maximum number of ngram occurs in on each reference instead of summing or averaging. We call this $\text{Max_Ref_Count}$. Now we want to “clip” count for each gram from translation.

\[Count_{clip} = min(Count, \text{Max_Ref_Count})\]

def count_clip_ngram(translation_u, list_of_reference_u, ngram=1):
    """
    Return
    ----
    
    """
    res = dict()
    # retrieve hypothesis counts
    ct_translation_u = count_ngram(translation_u, ngram=ngram)
    
    # retrieve translation candidate counts
    for reference_u in list_of_reference_u:
        ct_reference_u = count_ngram(reference_u, ngram=ngram)
        for k in ct_reference_u:
            if k in res:
                res[k] = max(ct_reference_u[k], res[k])
            else:
                res[k] = ct_reference_u[k]
                

    return {
        k: min(ct_translation_u.get(k, 0), res.get(k, 0)) 
        for k in ct_translation_u
    }

translation = 'is some'
references = [
    'this is a test',
    'this is is is test'
]

res = count_clip_ngram(
    translation.split(), 
    list(map(lambda ref: ref.split(), references))
)


assert res[('is',)] == 1
assert res[('some',)] == 0

print('result from count clip: {}'.format(res))

result from count clip: {('is',): 1, ('some',): 0}

Modified Precision

The modified precission ($p_n$, where $n$ is $n$-gram) is defined by

\[p_n = \frac{ \sum_{C \space \in \space \{Candidates\}} \sum_{n\text{-gram} \space \in \space C} Count_{clip}(n\text{-gram)} }{ \sum_{C' \space \in \space \{Candidates\}} \sum_{n\text{-gram}' \space \in \space C'} Count(n\text{-gram')} }\]

or in english language:

sum of all count clipped of n-gram of (translation and references) divided by sum of count n-gram from translation

def modified_precision(translation_u, list_of_reference_u, ngram=1):
    ct_clip = count_clip_ngram(translation_u, list_of_reference_u, ngram)
    ct = count_ngram(translation_u, ngram)
    
    return sum(ct_clip.values()) / float(max(sum(ct.values()), 1))

Brevity Penalty

Now we wanted to show that we will reward translation that matches one of reference candidate length, not shorted and not longer. Since references length may be difference, we could still find and use the closest/effective reference length. Here the term Brevity Penalty ($BP$) comes $BP \space \in \space [0, 1]$

and we make $BP$ function to be

\[BP = \begin{cases} 1 & \text{if} \space c > r \\ e^{(1-r/c)} & \text{if} \space c \leq r \end{cases}\]

where

$c$: length of candidate translation

$r$: effective reference length

We don’t penalize if $c$ larger than candidate translation because it has already penalized by modified precision. To be exact, because in modified precision when duplicate but relevant word occurs more than it supposed to or when the number of irrelevant word occurs more, it will results in larger denominator thus pulling the $BP$ towards 0.

We now want to show implementation of $BP$ in code.

def closest_ref_length(translation_u, list_of_reference_u):
    """
    determine the closest reference length from translation length
    """
    len_trans = len(translation_u)
    closest_ref_idx = np.argmin([abs(len(x) - len_trans) for x in list_of_reference_u])
    return len(list_of_reference_u[closest_ref_idx])
    

def brevity_penalty(translation_u, list_of_reference_u):
    """
    """
    c = len(translation_u)
    r = closest_ref_length(translation_u, list_of_reference_u)
    
    if c > r:
        return 1
    else:
        return np.exp(1 - float(r)/c)

The Bleu Score

Calculating BLEU after defining steps above is straightforward!

\[BLEU = BP.\exp(\sum^{N}_{n=1} w_n \log p_n)\]

where:

$N$: maximum number of $n$-gram. Usually 4.

$w_n$: weight for each modified precision. By default use $1/N$, so 0.25 each gram.

$p_n$: modified precision for each gram

def bleu_score(translation_u, list_of_reference_u, W=[0.25 for x in range(4)]):
    bp = brevity_penalty(translation_u, list_of_reference_u)
    modified_precisions = [
        modified_precision(translation_u, list_of_reference_u, ngram=ngram)
        for ngram, _ in enumerate(W,start=1)
    ]
    score = np.sum([
        wn * np.log(modified_precisions[i]) if modified_precisions[i] != 0 else 0 for i, wn in enumerate(W)
    ])
    
    return bp * np.exp(score)

assert bleu_score(
    translation.split(),
    list(map(lambda ref: ref.split(), references)),
) == bleu.sentence_bleu(
    list(map(lambda ref: ref.split(), references)),
    translation.split(),
)

print('All is good!')

All is good!

Conclusion

We have understand how to construct BLEU score for difference human reference and evaluate a hypothesis translation and we have seen that it did good job in scoring the hypothesis translation with this kind of setting.

BLEU might produce relatively good score to bad translation. In one of Phillip Koehn et al$^{[4]}$ work, they have explained phenomenons that makes BLEU flawed. However, it still is a good metric, used to evaluate machine translation performance up until now.