Content

Synopsis

This document provides the background and describes the approach used to implement the application “Next Word Prediction: Stupid Backoff”.

The application was developed as a capstone project for the for the cycle of courses Data Science Specialization offered on Coursera by Johns Hopkins University. The purpose of the capstone project is to build a Natural Language Processing (NLP) application that, given a chunk of text, predicts the next most probable word. The application may be used, for example, in mobile devices to provide suggestions as the user enters text, or as a part of a spelling correction module in a text editor.

You may find the following links useful:

Model

Our application predicts the next word in a text given a prefix by using a Markov Chain model simplified to n-grams.

The Markov Chain model assumes that in a natural language sentence a probability of each word depends only on previous words. The n-gram model simplifies the Markov Chain model by considering each word to depend only on the previous N words, thus ignoring long-range dependency. The n-gram model combines simplicity with an acceptable prediction quality, making it a model of choice for many Natural Language Processing (NLP) applications.

Implementation

Coursera provides a training text corpora HC Corpora (1). The corpora contains texts in several languages collected from various sources in Web, including blogs, news web sites and Twitter. The English corpora consists of approximately 4.2 millions lines of text.

We splitted the English corpus on 3 parts: training (60%), testing (20%) and validation (20%). The training part was used to train the prediction algorithm, the testing to optimize meta-parameters, and validation for the final validation of the algorithm.

The training corpus was used to build 1- to 5-grams. Each n-gram was further split on a (n-1)-gram prefix (empty for 1-grams) and a single-word suffix. Our model attempts to predict the last word (suffix) based on previous 4 words (prefix). For each n-gram \(w_1,w_2,\ldots,w_{n}\) we store in a lookup table the prefix \(w_1,w_2,\ldots,w_{n-1}\), the suffix \(w_n\), as well as \(P_{ML}(w_n | w_1,\ldots,w_{n-1})\), that is the maximum likehood estimate of the conditional probability of the suffix \(w_n\) given the prefix \(w_1,\ldots,w_{n-1}\).

The prediction algorithm consists of 2 steps: choosing candidates, and scoring candidates to select the best matches.

We decided to use very simple choosing algorithm: we choose all available candidates, proceeding from 5-grams to 1-grams and skipping suffixes which we have already collected. Given the prefix \(w_1,w_2,\ldots,w_{n-1}\), we choose all 5-grams starting with the 4-gram prefix \(w_{n-4},w_{n-3},w_{n-2},w_{n-1}\), all 4-grams starting wihh the 3-gram prefix \(w_{n-3},w_{n-2},w_{n-1}\), and so on.

The scoring algorithm selects best N matches from all candidates by calculating a numerical score of each candidate and choosing N candidates with top score. The score may be a probability, but it may be a different type of a numerical quantifier, as it is the case for the Stupid Backoff algorithm.

The Stupid Backoff algorithm is a high-efficient scoring algorithm proposed in 2007 by Thorsten Brants et al (2). On large data sets the algorithm gives scoring close to Kneser-Ney algorithm, but is significantly faster. The Stupid Backoff algorithm returns not probabilities, but relative scores of words (they do not sum to 1), which is sufficient for our purposes.

The Stupid Backoff algorithm is described by the following formula:

\(SB(w_n|w_1, w_2,\ldots,w_{n-1}) =\)

where \(P_{ML}(w_n|w_1, w_2,\ldots,w_{n-1})\) is the maximum likehood estimate of the conditional probability of the suffix \(w_n\) given the prefix \(w_1,\ldots,w_{n-1}\). Authors of the Stupid Backoff algorithm recommend to use \(\lambda = 0.4\).

In other words, first we attempt to look up the n-gram in the table for the largest n available. If the n-gram is found, than the score of the last word is the maximum likehood estimation of the conditional probability of the last word given the prefix. Otherwise, we back off (hence the algorithm name) to a table with (n-1)-grams, do the same calculations and multiply the result by \(\lambda = 0.4\). If the shorterned prefix is not found as well, the recursion goes deeper, concluding on 1-grams.

Optimization

In our implementation of the Stupid Backoff algorithm we have applied several optimizations to reduce the memory usage and latency.

By using all optimization techniques mentioned above, we were able to reduce the memory required to keep n-gram tables to 53.8 MiB (12 MiB in compressed RDS format). The average time required to predict 10 top-scoring candidates was under 20 ms on our hardware.

Extensions

We added to the base algoritm a simple optional extension which allows to predict a word the user is currently typing. As long as the last typed character is not the space character, the algorithm assumes that the user continues to type the current word, and predicts it. Only after a space character is typed the algorithm predicts the next word.

The extension pretty easy integrates in the step when we choose candidate words. Given the prefix \(w_1,w_2,\ldots,w_{n-1}\) and a partially typed word \(w^\prime_n\) we choose all 5-gram candidates which starts with the prefix \(w_1,w_2,\ldots,w_{n-1}\) and the 5th word starts with the word prefix \(w^\prime_n\), and similarly for 4- to 1-grams.

Some words may be missing in our dictionary, and the user may misspell some words. If we can’t find any candidates using our n-grams, we attempt to predict the next word using the partialy entered word by applying a spelling correction algorithm provided by the R package hunspell. For example, if the user enters a misspelled word “mashine”, we propose the spell-corrected word “machine”.

Validation

As you may remember from the Implementation, we have preserved 20% of the data for an off-sample validation test. The test was done as follows:

The chart below shows results of the algorithm validation. For example, for blogs our algorithm correctly predicted the next word in 15.42% of cases, the correct result was in top 3 predictions in 25.43% of cases, and in top 5 in 30.50% of the cases.

The following table shows the mean quality of our prediction algorithm (in which percentage of cases the right word was in top 1, top 3 and top 5), as well as 95% confidential interval.

Prediction precision (validation)
Word correctly predicted
Word in Top 3
Word in Top 5
Source Mean, % Conf. Int. 95% Mean, % Conf. Int. 95% Mean, % Conf. Int. 95%
Aggregated 15.40 (15.15, 15.65) 25.00 (24.73, 25.26) 29.86 (29.58, 30.14)
Blogs 15.42 (15.19, 15.66) 25.43 (25.13, 25.73) 30.50 (30.20, 30.81)
News 16.16 (15.91, 16.41) 25.67 (25.41, 25.94) 30.28 (30.00, 30.55)
Twitter 14.77 (14.55, 14.98) 24.18 (23.95, 24.41) 29.06 (28.81, 29.31)

As the table demonstrates, our predictions are more precise for news, less precise for blogs and Twitter. This is to be expected: news use more formal language where common expressions tend to repeat, whereas in blogs and especially on Twitter authors often prefer brevity to correctness.

Application

A Shiny application using our prediction algorithm is available online: https://serdioa.shinyapps.io/sb-predict.

The application GUI provides the following elements:

References

(1) HC Corpora provided by corpora.epizy.com. About the corpora. Download the corpora.

(2) Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean. 2007. Large Language Models in Machine Translation. https://www.aclweb.org/anthology/D07-1090.