Virtebi Algorithm and Hidden Markov Model - Part 2

I’ve implemented virtebi algorithm and explain the advantage from naive approach at last post. Now it’s time to look at another use case example: the Part of Speech Tagging!

POS Tag

Part of Speech Tag (POS Tag / Grammatical Tag) is a part of natural language processing task. The main problem is

“given a sequence of word, what are the postags for these words?”.

Example of POS Tag

If you understand this writing, I’m pretty sure you have heard categorization of words, like: noun, verb, adjective, etc. Well, those are POS Tag! But then we now have more POS Tag then you have teached in English! (you can see here)

For example this word:

**"I like sushi"**

could be broken down into:

I	like	sushi
PRON	VERB	NOUN

Where,

PRON = pronoun

VERB = verb

NOUN = noun

So how to answer the real question here?

One way to model on how to get the answer, is by: \(P(pos\space tag \space sequence \space | \space words \space sequence) = \prod_{i=1}^{I}{P(tag\space |\space word_{w})}\)

Hidden Markov Model using Pomegranate

We can impelement this model with Hidden Markov Model. For this experiment, I will use pomegranate library instead of developing on our own code like on the post before. It will enable us to construct the model faster and with more intuitive definition. I will walk you through the process.

import pandas as pd
# I will use nltk library with brown corpus as example
import nltk
import nltk.corpus as corp

import pomegranate as pm

# download the corpus from nltk repository
if not corp.brown:
    nltk.download('brown')

print('Brown corpus has {:,} tagged words / unigram'.format(len(corp.brown.tagged_words())))

Brown corpus has 1,161,192 tagged words / unigram

# see sneak peak from tagged words
corp.brown.tagged_words()[:10]

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN')]

# sampling for the first 5000 tokens just for demo sake
sample = corp.brown.tagged_words()[:5000]
print('We have sample: {} tokens'.format(len(sample)))

We have sample: 5000 tokens

def construct_discrete_distributions_per_tag(tagged_tokens):
    """
    input:
    -------
    tagged_tokens: list of tagged tokens in form like:
            [('The', 'AT'),
             ('Fulton', 'NP-TL'),
             ('County', 'NN-TL'),
             ('Grand', 'JJ-TL'),
             ('Jury', 'NN-TL'),
             ('said', 'VBD'),
             ('Friday', 'NR'),
             ('an', 'AT'),
             ('investigation', 'NN'),
             ('of', 'IN')]
    
    This function will generate initial probability for each tag
    """
    tag_probs = dict()
    for token, tag in tagged_tokens:
        if tag not in tag_probs:
            tag_probs[tag] = dict()
            tag_probs[tag]['count_tag'] = dict()
            tag_probs[tag]['occurence'] = 1
        else:
            tag_probs[tag]['occurence'] += 1
            
        if token not in tag_probs[tag]['count_tag']:
            tag_probs[tag]['count_tag'][token] = 1
        else:
            tag_probs[tag]['count_tag'][token] += 1
            
    for tag in tag_probs:
        tag_probs[tag]['probs'] = dict()
        for token in tag_probs[tag]['count_tag']:
            tag_probs[tag]['probs'][token] = float(tag_probs[tag]['count_tag'][token]) / float(tag_probs[tag]['occurence'])
    
    return tag_probs            

def construct_transition_probabilities_per_tag(tagged_tokens):
    """
    input:
    -------
    tagged_tokens: list of tagged tokens in form like:
            [('The', 'AT'),
             ('Fulton', 'NP-TL'),
             ('County', 'NN-TL'),
             ('Grand', 'JJ-TL'),
             ('Jury', 'NN-TL'),
             ('said', 'VBD'),
             ('Friday', 'NR'),
             ('an', 'AT'),
             ('investigation', 'NN'),
             ('of', 'IN')]
             
    This function will generate the emission matrix / probabilities for each tag
        
    """
    transition_probs = dict()
    for i in range(len(tagged_tokens)):
        current_tag = tagged_tokens[i][1]
        if current_tag not in transition_probs:
            transition_probs[current_tag] = dict()
            transition_probs[current_tag]['occurence'] = 0
            transition_probs[current_tag]['count_transition'] = dict()
        
        # evaluate previous tag
        if i > 0:
            previous_tag = tagged_tokens[i-1][1]
            pt = tagged_tokens[i-1][0]
            transition_probs[previous_tag]['occurence'] += 1
            
            # special case for <start> tag
            if pt == '.':
                if '<start>' not in transition_probs:
                    transition_probs['<start>'] = dict()
                    transition_probs['<start>']['occurence'] = 0
                    transition_probs['<start>']['count_transition'] = dict()
                if current_tag not in transition_probs['<start>']['count_transition']:
                    transition_probs['<start>']['count_transition'][current_tag] = 0
                    
                transition_probs['<start>']['count_transition'][current_tag] += 1
                transition_probs['<start>']['occurence'] += 1
                    
            
            #init
            if current_tag not in transition_probs[previous_tag]['count_transition']:
                transition_probs[previous_tag]['count_transition'][current_tag] = 0
                
            transition_probs[previous_tag]['count_transition'][current_tag] += 1
    for tag in transition_probs:
        transition_probs[tag]['probs'] = dict()
        for transit_tag in transition_probs[tag]['count_transition']:
            transition_probs[tag]['probs'][transit_tag] = \
                float(transition_probs[tag]['count_transition'][transit_tag]) / float(transition_probs[tag]['occurence'])
    
    return transition_probs

def build_hmm_model(token_dist, transition_dist, model_name='hmm-tagger'):
    state_dict = dict()
    for token in token_dist:
        state_dict[token] = \
            pm.State(
                pm.DiscreteDistribution(
                    token_dist[token]['probs']
                )
                , name=token
            )
            
    model = pm.HiddenMarkovModel(model_name)
    model.add_states(list(state_dict.values()))
    
    # initialization for starting tokens
    for token, prob in transition_dist['.']['probs'].items():
        model.add_transition(state_dict[token], model.end, prob)
        
    for token, prob in transition_dist['<start>']['probs'].items():
        model.add_transition(model.start, state_dict[token], prob)
    
    transition_dist_list = list(transition_dist.items())
    for i in range(1, len(transition_dist_list)):
        ptoken, pmeta = transition_dist_list[i]
        if ptoken != '.' and ptoken != '<start>':
            for ctoken, cprob in pmeta['probs'].items():
                
                model.add_transition(
                    state_dict[ptoken],
                    state_dict[ctoken],
                    cprob
                )

        
    return model, state_dict

from sklearn.base import BaseEstimator, ClassifierMixin 

class HmmTaggerModel(BaseEstimator, ClassifierMixin):
    """
    POS Tagger with Hmm Model
    """
    def __init__(self):
        self._inner_model = None
        self._tag_dist = None
        self._transition_dist = None
        self._state_dict = None
    
    def fit(self, X, y=None):
        """
        expecting X as list of tokens, while y is list of POS tag
        """
        combined = list(zip(X, y))
        self._tag_dist = construct_discrete_distributions_per_tag(combined)
        self._transition_dist = construct_transition_probabilities_per_tag(combined)
        
        self._inner_model, _ = build_hmm_model(self._tag_dist, self._transition_dist)
        self._inner_model.bake()
        
    
    def predict(self, X, y=None):
        """
        expecting X as list of tokens
        """
        return [state.name for i, state in self._inner_model.viterbi(X)[1]][1:-1]

def accuracy(ypred, ytrue):
    total = len(ytrue)
    correct = 0
    for pred, true in zip(ypred, ytrue):
        if ypred == ytrue:
            correct += 1
    
    return correct / total

model = HmmTaggerModel()
model.fit(map(lambda x: x[0], sample), map(lambda x: x[1], sample))

import numpy as np

st_idx = [0, 68]
end_idx = [10, 72]

list_acc = list()
for st, end in zip(st_idx, end_idx):
    actual_words = list(map(lambda x: x[0], sample[st:end]))
    actual = list(map(lambda x: x[1], sample[st:end]))
    predicted = model.predict(
            list(map(lambda x: x[0], sample[st:end]))
        )
    print('actual words: {}'.format(' '.join(actual_words)))
    print('actual tags: {}'.format(' '.join(actual)))
    print('predicted tags: {}'.format(' '.join(predicted)))
    print('============================================================')
    list_acc.append(accuracy(actual, predicted))
mean_acc = np.mean(list_acc)

print('mean accuracy: {}'.format(mean_acc))

actual words: The Fulton County Grand Jury said Friday an investigation of
actual tags: AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN
predicted tags: AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN
============================================================
actual words: The September-October term jury
actual tags: AT NP NN NN
predicted tags: AT NP NN NN
============================================================
mean accuracy: 1.0

That went well! Now lets try for bigger corpuses! I wil use 500,000 words from the brown corpus.

larger_sample = corp.brown.tagged_words()[:500000]
model = HmmTaggerModel()
model.fit(
    map(lambda x: x[0], larger_sample), 
    map(lambda x: x[1], larger_sample)
)

st_idx = [0, 68, 103]
end_idx = [10, 72, 118]

list_acc = list()
for st, end in zip(st_idx, end_idx):
    actual_words = list(map(lambda x: x[0], sample[st:end]))
    actual = list(map(lambda x: x[1], sample[st:end]))
    predicted = model.predict(
            list(map(lambda x: x[0], sample[st:end]))
        )
    print('actual words: {}'.format(' '.join(actual_words)))
    print('actual tags: {}'.format(' '.join(actual)))
    print('predicted tags: {}'.format(' '.join(predicted)))
    print('============================================================')
    list_acc.append(accuracy(actual, predicted))
mean_acc = np.mean(list_acc)

print('mean accuracy: {}'.format(mean_acc))

actual words: The Fulton County Grand Jury said Friday an investigation of
actual tags: AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN
predicted tags: AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN
============================================================
actual words: The September-October term jury
actual tags: AT NP NN NN
predicted tags: AT NP NN NN
============================================================
actual words: `` Only a relative handful of such reports was received '' , the jury said
actual tags: `` RB AT JJ NN IN JJ NNS BEDZ VBN '' , AT NN VBD
predicted tags: `` RB AT JJ NN IN JJ NNS BEDZ VBN '' , AT NN VBD
============================================================
mean accuracy: 1.0

It does make a good model!! Though it takes more time for larger model.

Is This a Perfect Model?

But now you might wonder, is this the perfect model for POS tagging? Well congratulation! you asked a good question and the answer is No! It is not the perfect model. I have only showed you working case, but actually there will be problems:

Out Of Vocabulary (OOV), where there are no matching word between input and training data.
Training data without exhaustive positioning of tagging. This simple model will not be able to adjust it self if, let’s say there is a word tagged as NP in the beginning of sentence, but this has never happened in training data.
Lack of preprocessing, such as lower casing, punctuation removal, etc will make the model not focused enough into predicting POS tag.

Conclussion

Even though HMM will produces a fairly good model for POS Tagging, but you need to watch for disadvantages for using this model.

References:

https://en.wikipedia.org/wiki/Brown_Corpus
https://pomegranate.readthedocs.io/en/latest/HiddenMarkovModel.html#hiddenmarkovmodel

Written on July 29, 2018

ariepratama.github.io

Personal Blog