Content

Synopsis
Model
Downloading and splitting the corpus
Cleaning and preprocessing the corpus
Building n-grams
Binary encoding for stems
Optimizing n-gram tables.
Stupid Backoff algorithm
Extension of the algorithm
Testing algorithm
Optimizing meta-parameters
Algorithm speed
Validation
Application
References

Synopsis

This document provides the background and describes the approach used to implement the application “Next Word Prediction: Stupid Backoff”.

The application was developed as a capstone project for the for the cycle of courses Data Science Specialization offered on Coursera by Johns Hopkins University. The purpose of the capstone project is to build a Natural Language Processing (NLP) application that, given a chunk of text, predicts the next most probable word. The application may be used, for example, in mobile devices to provide suggestions as the user enters text, or as a part of a spelling correction module in a text editor.

You may find the following links useful:

Short version of this document: https://serdioa.github.io/DataScienceCapstone/ImplementationReportShort.html
Presentation: https://serdioa.github.io/DataScienceCapstone/presentation/index.html
Online application: https://serdioa.shinyapps.io/predict-sb
Source code: https://github.com/serdioa/DataScienceCapstone

Model

We are going to predicts the next word in a text given a prefix by using a Markov Chain model simplified to n-grams.

The Markov Chain model assumes that in a natural language sentence a probability of each word depends only on previous words. The n-gram model simplifies the Markov Chain model by considering each word to depend only on the previous N words, thus ignoring long-range dependency. The n-gram model combines simplicity with an acceptable prediction quality, making it a model of choice for many Natural Language Processing (NLP) applications.

Downloading and splitting the corpus

Before we start discussing the application in detail, there is a short general consideration. We decided to not include R script sources as the part of this document, but to keep them in separate script files which we reference. Moreover, many steps done when preparing the data and tuning meta-parameters require many hours to run. It would be impossible to re-run all steps each time when we re-render this document. R Markdown provides some possibilities to cache intermediate results, but they are not adequate to our task, since we were not able to keep in memory all intermediate results at once. Hence our approach was to process the data piecewise, explicitly caching intermediate results.

Coursera provides a training text corpora HC Corpora (1). The corpora contains texts in several languages collected from various sources in Web, including blogs, news web sites and Twitter. The English corpora consists of approximately 4.2 millions lines of text.

We start by downloading and unzipping the text corpora. The downloaded data is stored in the directory “cache”.

source("include/download.R")
download.data()

The downloaded zip file contains corpora in several languages: English, German, Russian and Finnish. In our project we will use only English corpora.

Corpora in each language, including English, contains 3 files with content obtained from different sources: news, blogs and twitter.

As the first step, we split each file from the English corpora on 3 parts:

Training set (60%) will be used to build and train the algorithm.
Testing set (20%) will be used to test the algorithm during it’s development and tune meta-parameters. This set may be used more than once.
Validation set (20%) will be used for a final validation and estimation of out-of-sample prediction quality This set will be used only once.

source("include/split.R")
set.seed(20190530)
split.all()

As a sanity check, we count a number of lines in each source file, as well in the partial files produced by the split.

	Blogs		News		Twitter
	Rows	%	Rows	%	Rows	%
Training	539572	59.99991	606145	59.99998	1416088	59.99997
Testing	179858	20.00004	202048	19.99996	472030	20.00002
Validation	179858	20.00004	202049	20.00006	472030	20.00002
Total	899288	100.00000	1010242	100.00000	2360148	100.00000
Control (expected to be 0)	0	NA	0	NA	0	NA

As the table shows, we have splitted the data on sub-sets as intended.

Cleaning and preprocessing the corpus

We decided to split text on sentences and do not attempt to predict words across sentence border. We still may use information about sentences to improve prediction of the first word, because the frequency of the first word in a sentence may be very different from an average frequency.

We apply the following cleaning and pre-processing steps:

Remove URLs, e-mail addresses and Twitter hash tags.
Remove underscroll characters which sometimes are used as emphasys or to replace a profanity.
Add missing space characters around punctuation characters as the preparation for removing punctuation characters altogether. For example, the corpus contains a sentence “I had the best day yesterday,it was like two kids in a candy store”. Note that the sentence does not contain a space character after the comma, so when punctuation characters are removed, words “yesterday,it” are transformed into a non-existing word “yesterdayit”. Adding missing space characters prevents such error.
Remove punctuation characters.
Replace common short forms such as “I’m”, with full forms such as “I am”. The list of replacements is available in the file replacements.txt.
Remove words with non-latin characters. The corpora contains some words and even whole sentences in languages which are written using non-latin alphabets, such as Greek or Russian. We keep only words with standard and extended latin characters (that is, latin characters with various accents), but exclude words with any other characters.
Prepend the special token STOS (Start-of-Sentence). When building n-grams, this token will allow us to distinguish n-grams which appear at the start of sentences.

Natural language processing has a notion of stop words, that is words which occure very often, and are often filtered out during processing. It is not obvious on this step if we should include or ignore stop words for the purposes of our task, so we decided to continue in parallel with both approaches. Thus our pre-processing function has one more optional step which removes stop words.

source("include/preprocess.R")
preprocess.all(removeStopwords = FALSE)
preprocess.all(removeStopwords = TRUE)

On our hardware (Intel i5-7600K) commands above runs several minutes.

Building n-grams

In the previous section we have pre-processed the corpora: removed URLs and e-mail addresses, get rid of the punctuation etc. Our next task is to build n-grams and corresponding frequency tables.

An important question is which library to use for processing and analyzing the corpora, as R provides several alternatives. Initially we attempted to use the library tm, but quickly found that the library is very memory-hungry, and an attempt to build bi- or trigrams for a large corpus are not practical. After some googling we decided to use the library quanteda instead.

When building n-grams, we apply the following principles:

We build 1-grams to 5-grams.
In each n-gram we distinguish the prefix, that is the first (n-1) words, from the suffix, that is the last word. As the special case, 1-grams have an empty prefix (NA). We treat words in the prefix and the suffix differently as described below.
For the prefix, we keep only stems of words. For stemming we use the function SnowballC::wordStem(). Keeping only stems reduces the size of our tables, and allows to better use word patterns which otherwise may be not found due to slightly different word forms.
For the suffix (the last word of an n-gram) we keep the word “as is” without any transformations.
We build the list of top-frequency \(2^{16}-2\) stems and keep only those stems in n-gram prefixes. All other stems which appear less often are replaced in prefixes by the special token UNK (Unknown). The number \(2^{16}-2\) allows us to represent each stem which may appear in an n-gram prefix, as well as 2 special tokens STOS (Start-of-Sentence) and UNK (Unknown) with a number between \(0\) and \(2^{16}-1\). Reasons for this will be explained in the section Binary encoding for stems.

source("include/ngram.build.R")
for (n in 1:5) {
  ngram.freq.cache(n, removeStopwords = FALSE)
  ngram.freq.cache(n, removeStopwords = TRUE)
}

On our hardware (Intel i5-7600K) commands above runs approximately 6 hours.

As a result of this step, for 1- to 5-grams we obtain frequency tables with the columns Term (the n-gram \(w_1,\ldots,w_n\)) and Freq (the number of times \(c(w_1,\ldots,w_n)\) this n-gram appears in the aggregated corpus).

For the purpose of our task, that is prediction of the next word, we require the frequency table to be transformed. For each n-gram \(w_1,\ldots,w_n\), the transformed table shall contain:

Prefix - the first \(n-1\) words of the n-gram, that is \(w_1,\ldots, w_{n-1}\). 1-grams have an empty prefix NA.
Suffix - the last word of the n-gram, that is \(w_n\).
Prob - maximum likehood estimate of the conditional probability of the Suffix \(w_n\) given the Prefix \(w_1,\ldots,w_{n-1}\)

\[P_{ML}(w_n | w_1,\ldots,w_{n-1}) = \frac{c(w_1,\ldots,w_n)} {\sum c(w_1,\ldots,w_{n-1},\bullet)}\]

The sum in the denominator is over all n-grams sharing the same Prefix \(w_1,\ldots,w_{n-1}\).

source("include/ngram.build.R")
for (n in 1:5) {
  ngram.extended.cache(n, removeStopwords = FALSE)
  ngram.extended.cache(n, removeStopwords = TRUE)
}

On our hardware (Intel i5-7600K) commands above runs a bit more than 1 hour and use slightly over 30 GB of memory.

The following table shows number of rows and memory usage of n-gram tables with and without stop words.

	With stop words		Without stop words
N	Rows	Size, MiB	Rows	Size, MiB
1	524,351	42.0	523,757	42.0
2	8,240,587	222.3	11,449,151	295.7
3	25,403,862	972.1	23,256,283	1,063.5
4	38,589,792	2,413.9	22,993,832	1,941.9
5	42,597,132	3,614.9	19,821,557	1,955.6
Total:		7,265.2		5,298.7

As the table shows, tables are very large and do not satisfy requirements for a Shiny application (max. 1 GB). Besides, with very large tables we may expect a large latency. In the next sections we will reduce the size of n-gram tables.

Binary encoding for stems

As we have found in the previous section, our n-gram tables are very large. A very significant part of the memory usage is due to the prefixes. We decided to reduce the memory usage by using a binary encoding for the prefixes.

Although R has the data type raw which works directly with bytes, the storage of this data type is very inefficient, besides, in our experience it is very inconvenient to use as a column in a data frame. We decided to encode stems as R data type integer (4 bytes) and numeric (8 bytes) instead.

Theoretically the Huffman coding may have been used to compress the data and reduce the storage size, but in our case it is not practical: since our prefixes contains up to 4 stems, and we want to encode \(2^{16}\) tokens, even by using the Huffmann encoding we will not be able to compress each prefix under 4 bytes, so we have to use 8-bytes storage anyway. To simplify the implementation, we decided for a simple non-compressing encoding instead.

Each of top-frequency \(2^{16}-2\) stems, as well as 2 special tokens STOS (Start-of-Sentence) and UNK (Unknown) is assigned a code from \(0\) to \(2^{16}-1\) which may be represented as 2 bytes.
For 1-grams we do not require any encoding at all, because the prefix is always NA. Remember, the prefix is the first (n-1) words, that is \(0\) words for 1-grams.
For 2-grams the prefix contains just 1 stem, so the encoding is just the code of the stem (4-bytes integer).
For 3-grams the prefix contains 2 stems. We transform their codes to 2 bytes each, concatenate these bytes into a 4-byte sequence, and interpret the sequence as a 4-bytes integer.
For 4-grams the prefix contains 3 stems. We transform their codes to 2 bytes each, concatenate these bytes together with two 0x00-bytes into a 8-bytes sequence, and interpret the sequence as an 8-bytes numeric.
For 5-grams the prefix contains 4 stems. We use the same approach as for 4-grams, except that we do not need to add 0x00-bytes since we already have 8 bytes from prefix codes.

Let us walk through several examples. We will encode the 4-gram “the answer to that”, as well as it’s 1-, 2- and 3-word prefixes.

First, let us lookup codes for each stem in our 4-gram:

source("include/ngram.encode.R")
dict.stems <- dict.stems.cache()
dict.hash <- dict.hash.cache()
dict.hash[[c("the", "answer", "to", "that")]]

## [1]   2 619   3  11

The corresponding hexadecimal codes are 0x0002, 0x026B, 0x0003, 0x000B.

We start by encoding and decoding the first word, “the”. For a 1-gram an encoded value is just an integer equal to the code, that is the encoded value is 2.

enc.1 <- tokens.encode.1(c("the"), dict.hash)
enc.1

## [1] 2

tokens.decode.1(enc.1, dict.stems)

## [1] "the"

We continue with the 2-gram “the answer”. Concatenated hexadecimal codes are 0x0002026B. Interpreted as an integer (little-endian), that gives us 1795293696.

enc.2 <- tokens.encode.2(c("the", "answer"), dict.hash)
enc.2

## [1] 1795293696

tokens.decode.2(enc.2, dict.stems)

## [1] "the"    "answer"

Next we encode the 3-gram “the answer to”. We concatenate the hexadecimal codes and append two 0x00 bytes to get 8 bytes in total to obtain 0x0002026B00030000. Interpreted as a numeric (little-endian), that gives us 1.6305797604007e-311. Note that the decimal notation is not precise enough: to correctly decode the value, you require absolutely exact value which preserves every single bit.

enc.3 <- tokens.encode.3(c("the", "answer", "to"), dict.hash)
enc.3

## [1] 1.63058e-311

tokens.decode.3(enc.3, dict.stems)

## [1] "the"    "answer" "to"

Finally, we encode the 4-gram “the answer to that”. We concatenate the hexadecimal codes to obtain 8-bytes sequence 0x0002026B0003000B, and interpret it as a numeric (little-endian) 1.0663795695223075e-255. The same note about the decimal notation as for the 3-gram applies here as well.

enc.4 <- tokens.encode.4(c("the", "answer", "to", "that"), dict.hash)
enc.4

## [1] 1.06638e-255

tokens.decode.4(enc.4, dict.stems)

## [1] "the"    "answer" "to"     "that"

We will use the binary encoding of prefixes in the next section to reduce the size of n-gram tables.

Optimizing n-gram tables

In the section Building n-grams we built n-gram tables, and found that the tables are too large. In the previous section Binary encoding for stems we have developed a method to encode prefixes of n-grams either as integer or as numeric, dependent on a size of the prefix.

Let us apply the encoding to prefixes of our n-gram tables. We also use this opportunity to remove the column Prefix from 1-grams since it is always NA anyway.

source("include/ngram.optimize.R")
for (n in 1:5) {
  ngram.optimize.prefix.cache(n, removeStopwords = FALSE)
  ngram.extended.prefix.cache(n, removeStopwords = TRUE)
}

On our hardware (Intel i5-7600K) commands above runs a bit less than 4 hours.

The following table shows size of n-gram tables with encoded prefixes compared to original tables. The improvement is more pronounced for larger n. For instance, the size of the 5-gram table with stop words is reduced from 3614.9 MiB to 999.6 MiB, or just to 27.65% of the former size.

	With stop words				Without stop words
N	Rows	Original Size, MiB	Optimized Size, MiB	Optimized vs. Original, %	Rows	Original Size, MiB	Optimized Size, MiB	Optimized vs. Original, %
1	524,351	42.0	38.0	90.48	523,757	42.0	38.0	90.48
2	8,240,587	222.3	187.2	84.21	11,449,151	295.7	248.4	84.00
3	25,403,862	972.1	512.0	52.67	23,256,283	1,063.5	470.3	44.22
4	38,589,792	2,413.9	909.1	37.66	22,993,832	1,941.9	550.5	28.35
5	42,597,132	3,614.9	999.6	27.65	19,821,557	1,955.6	475.6	24.32
Total:		7,265.2	2,645.9	36.42		5,298.7	1,782.8	33.65

Another opportunity for reducing the required memory is the column Prob which contains the maximum likehood estimate of the conditional probability of the Suffix given the Prefix. Instead of storing it as a numeric column (8 bytes for each value) we may squeeze it into the integer (4 bytes) using the scaled log approach. The easiest way to explain this approach is with an example:

There are 330,145 2-grams starting with the word “at”. In 2,035 cases the word “at” is followed by the word “the”, giving the 2-gram “at the”. We calculate the maximum likehood estimate of the probability of the suffix “the” given the prefix “at” as

\[P_{ML}(\text{the} | \text{at}) = \frac{c(\text{at the})} {\sum c(\text{at } \bullet)} = \frac{2035}{330145} \approx 0.006163958\]

We calculate the natural logarithm of the value: \(log(P_{ML}(\text{the} | \text{at})) =\) -5.0890361.
We multiply the logarithm by \(10^6\) and get an integer part of the result. Multiplying by \(10^6\) ensures that we preserve 6 digits after the decimal point in the original logarithm. In our example we get -5089036. This is the value we store in our n-gram table as an integer, saving 4 bytes.
When we require the original value, we inverse the transformation, dividing the optimized value by \(10^6\) and taking the exponent of it. Continuing our example, exp(-5089036 / \(10^6\)) = 0.006163959.
By rounding to integer we lost some precision, but the difference is just 0.0000000008, or 0.00000013% of the original value, which is negligible for our purposes.
At the price of very slight precision loss we could store our values as an integer (4 bytes per value) instead of a numeric (8 bytes).

for (n in 1:5) {
  ngram.optimize.prob.cache(n, removeStopwords = FALSE)
  ngram.extended.prob.cache(n, removeStopwords = TRUE)
}

On our hardware (Intel i5-7600K) commands above runs approximately 5 minutes.

The folowing table shows size of n-gram tables with probabilities optimized for low memory consumption compared to the previous version with encoded prefixes. The size reduction is not as significant as in the previous step. For instance, the total size of n-gram tables with stop words is reduced from 2645.9 MiB to 2205.9 MiB, or to 83.37% of the former size.

	With stop words				Without stop words
N	Rows	Original Size, MiB	Optimized Size, MiB	Optimized vs. Original, %	Rows	Original Size, MiB	Optimized Size, MiB	Optimized vs. Original, %
1	524,351	38.0	36.0	94.74	523,757	38.0	36.0	94.74
2	8,240,587	187.2	155.8	83.23	11,449,151	248.4	204.7	82.41
3	25,403,862	512.0	415.1	81.07	23,256,283	470.3	381.6	81.14
4	38,589,792	909.1	761.9	83.81	22,993,832	550.5	462.7	84.05
5	42,597,132	999.6	837.1	83.74	19,821,557	475.6	400.0	84.10
Total:		2,645.9	2,205.9	83.37		1,782.8	1,485.0	83.30

The total memory usage is still above our maximum allowed size of 1 GB. To achieve it, we have to find another ways to reduce the tables with n-grams. But before we do it, let us discuss our prediction algorithm and run some tests.

Stupid Backoff algorithm

The prediction algorithm used by our application consists of 2 steps: choosing candidates, and scoring candidates to select the best matches.

We decided to use very simple choosing algorithm: we choose all available candidates, proceeding from 5-grams to 1-grams and skipping suffixes which we have already collected. More precisely, given the input text \(w_1, w_2,\ldots,w_{n-1}\), we proceed as follows:

Choose 5-gram candidates which starts with the 4-gram prefix \(w_{n-4}, w_{n-3}, w_{n-2}, w_{n-1}\).
Choose 4-gram candidates which starts with the 3-gram prefix \(w_{n-3}, w_{n-2}, w_{n-1}\), except those which suffix already appeared in choosen 5-grams.
The same for 3- and 2-grams.
Our implementation allows to choose the number of candidates to show. If at this step we have found less candidates than requested, choose missing candudates from the most frequent 1-grams (remember, 1-grams do not have a prefix).

Note that we do not stop once we have choosen the required number of candidates, but continue to n-grams for lower n. Suppose that we were requested to predict 3 most probable words, and there are exactly 3 5-grams with the appropriate prefix. If we stop choosing candidates at this moment, we may miss 4-gram which produces a prediction with higher score than any 5-gram.

The scoring algorithm selects best N matches from all candidates by calculating a numerical score of each candidate and choosing N candidates with top score. The score may be a probability, but it may be a different type of a numerical quantifier, as it is the case for the Stupid Backoff algorithm.

The Stupid Backoff algorithm is a high-efficient scoring algorithm proposed in 2007 by Thorsten Brants et al (2). On large data sets the algorithm gives scoring close to Kneser-Ney algorithm, but is significantly faster. The Stupid Backoff algorithm returns not probabilities, but relative scores of words (they do not sum to 1), which is sufficient for our purposes.

The Stupid Backoff algorithm is described by the following formula:

\[SB(w_n|w_1, w_2,\ldots,w_{n-1}) = \begin{cases} P_{ML}(w_n|w_1, w_2,\ldots,w_{n-1}) \text{ if } P_{ML}(w_n|w_1, w_2,\ldots,w_{n-1}) > 0 \\ \lambda \; SB(w_n|w_2,\ldots,w_{n-1}) \text{ otherwise} \end{cases}\]

where \(P_{ML}(w_n|w_1, w_2,\ldots,w_{n-1})\) is the maximum likehood estimation of the conditional probability of the word \(w_n\) given the prefix \(w_1, w_2,\ldots,w_{n-1}\). Authors of the Stupid Backoff algorithm recommend to use \(\lambda = 0.4\).

In other words, first we attempt to look up the n-gram in the table for the largest n available. If the n-gram is found, than the score of the last word is the maximum likehood estimation of the conditional probability of the last word given the prefix. Otherwise, we back off (hence the algorithm name) to a table with (n-1)-grams, do the same calculations and multiply the result by \(\lambda = 0.4\). If the shorterned prefix is not found as well, the recursion goes deeper, concluding on 1-grams.

Note that our n-gram tables already contains the required maximum likehood estimation (MLE), so our calculation of the score is trivial:

If the choosen candidate is from the 5-gram table, return it’s MLE as a score.
If the choosen candidate is from the 4-gram table, return it’s MLE multiplied by \(\lambda = 0.4\).
If the choosen candidate is from the 3-gram table, return it’s MLE multiplied by \(\lambda^2 = 0.4^2 = 0.16\).
The same for 2- and 1-grams, each time multiplying by \(\lambda\) one more time.

To better understand the algorithm, we will walk through an example. We want to predict the next word after the sequence “you pay to get”.

We lookup up the 4-word prefix “you pay to get” in the table with 5-grams. We find there 3 matches, the score for each of them equals to the probability:

Prefix	Suffix	Probability	Score
you pay to get	in	0.50	0.50
you pay to get	published	0.25	0.25
you pay to get	there	0.25	0.25

We add all these matches to the list of candidates.

We look up the 3-word prefix “pay to get” in the table with 4-grams. We find 22 matches, the score for each of them equals the probability multiplied by \(\lambda = 0.4\). The table below lists the top 5 of the candidates.

Prefix	Suffix	Probability	Score
pay to get	in	0.1430	0.05720
pay to get	the	0.1140	0.04560
pay to get	a	0.0857	0.03428
pay to get	my	0.0857	0.03428
pay to get	their	0.0571	0.02284

The first suffix, “in”, is already in our list, since we have already found it in the table with 5-grams. The next suffix, “the”, is not in our list yet, so we add it. We continue the same way for all 4-grams, adding suffixes which are not in our list yet.

Although we have already found 22 candidates, we continue to 3-grams, because it could happens that some 3-gram candidate will get a score high enough to get into the top list. We look up the 2-word prefix “to get” in the table with 3-grams. We find more than 3000 matches, the score for each of them equals the probability multiplied by \(\lambda^2 = 0.4^2 = 0.16\). The table below lists the top 5 of them.

Prefix	Suffix	Probability	Score
to get	a	0.0858	0.013728
to get	the	0.0700	0.011200
to get	to	0.0508	0.008128
to get	out	0.0338	0.005408
to get	back	0.0334	0.005344

We add to our list candidates which are not present it in yet.

We repeat the same step with 2-grams, looking up 1-word prefix “get”. We will find nearly 10.000 matches, the score for each of them equals the probability multiplied by \(\lambda^3 = 0.4^3 = 0.064\). We add to our lists candidates which are not present it in yet.
If we already have enough candidates (which is the case, unless we want to find more than 10.000 top-score candidates), we do not need to get context-free candidates from the 1-gram table.
Suppose our goal is to find the top 10 candidates. We sort our list by score in the descending order, and select the top 10 rows:

Prefix	Suffix	Probability	Score
you pay to get	in	0.50000	0.5000
you pay to get	published	0.25000	0.2500
you pay to get	there	0.25000	0.2500
pay to get	the	0.11420	0.0457
pay to get	a	0.08571	0.0342
pay to get	my	0.08570	0.0342
pay to get	their	0.05710	0.0228
pay to get	out	0.05710	0.0228
pay to get	inside	0.02850	0.0114
pay to get	past	0.02857	0.0114

Extension of the algorithm

We added to the base algoritm a simple optional extension which allows to predict a word the user is currently typing. As long as the last typed character is not the space character, the algorithm assumes that the user continues to type the current word, and predicts it. Only after a space character is typed the algorithm predicts the next word.

The extension pretty easy integrates in the step when we choose candidate words. Given the prefix \(w_1,w_2,\ldots,w_{n-1}\) and a partially typed word \(w^\prime_n\) we choose all 5-gram candidates which starts with the prefix \(w_1,w_2,\ldots,w_{n-1}\) and the 5th word starts with the word prefix \(w^\prime_n\), and similarly for 4- to 1-grams.

Some words may be missing in our dictionary, and the user may misspell some words. If we can’t find any candidates using our n-grams, we attempt to predict the next word using the partialy entered word by applying a spelling correction algorithm provided by the R package hunspell. For example, if the user enters a misspelled word “mashine”, we propose the spell-corrected word “machine”.

Testing algorithm

Now we are ready to test our algorithm for predicting the next word using 20% of the data we have saved for testing before. We run tests for each data source, as well as for the aggregated data.

Choose 100.000 random sample sentences from each source (blogs, news, Twitter). Split each selection on 100 batches, each of 1000 sentences.
Create an aggregated test set of 100.000 sentences by choosing 1/3 of sentences from each source. Split the aggregated test set on batches as well.
Choose a random word in a sentence, but not the very first word. Use the part of the sentence before the selected word as a prefix, and attempt to predict the selected word.
Run the prediction algorithm for all samples, predicting top 5 candidates.
For each batch of 1000 sentences, calculate percentage of cases when the word actually present in the sentence was in top 1, top 3 or top 5 of predicted candidates.

source("include/predict.test.R")

for (source in c("blogs", "news", "twitter", "all")) {
    set.seed(12345)
    predict.test.sb.cache(source, "testing", n.samples = 100000,
                          threshold = 1, removeStopwords = TRUE)
    
    set.seed(12345)
    predict.test.sb.cache(source, "testing", n.samples = 100000,
                          threshold = 1, removeStopwords = FALSE)
}

The meaning of the parameter “threshold” will be explained later, when we will optimize meta-parameters of the prediction algorithm. Now it is enough to say that “threshold = 1” uses complete tables with n-grams we have built before.

On our hardware (Intel i5-7600K) commands above runs a bit more than 7 hours.

The chart below shows the the precision of our prediction algorithm for the aggregated data sources, as well as separately for blogs, news and Twitter, with and without stop words. The first chart column shows percentage of correct predictions (our top candidate is exactly the word present in the original text). The second and third chart columns show percentage of cases when the right word was amongst our Top-3 or Top-5 predicted candidates.

Prediction precision, with stop words
	Word correctly predicted		Word in Top 3		Word in Top 5
Source	Mean, %	Conf. Int. 95%	Mean, %	Conf. Int. 95%	Mean, %	Conf. Int. 95%
Aggregated	16.45	(16.20, 16.70)	26.34	(26.06, 26.62)	31.41	(31.11, 31.70)
Blogs	15.40	(15.19, 15.62)	25.58	(25.30, 25.86)	30.80	(30.53, 31.08)
News	17.51	(17.28, 17.74)	27.50	(27.22, 27.78)	32.56	(32.27, 32.85)
Twitter	15.69	(15.42, 15.95)	25.35	(25.06, 25.63)	30.19	(29.89, 30.49)

Prediction precision, without stop words
	Word correctly predicted		Word in Top 3		Word in Top 5
Source	Mean, %	Conf. Int. 95%	Mean, %	Conf. Int. 95%	Mean, %	Conf. Int. 95%
Aggregated	4.00	(3.88, 4.13)	6.13	(5.98, 6.29)	7.29	(7.12, 7.46)
Blogs	2.73	(2.61, 2.85)	4.45	(4.31, 4.59)	5.44	(5.28, 5.60)
News	5.16	(5.02, 5.31)	7.60	(7.43, 7.77)	8.87	(8.69, 9.04)
Twitter	3.92	(3.78, 4.06)	6.10	(5.93, 6.27)	7.31	(7.13, 7.49)

The first important insight from the chart and tables above is that the prediction based on n-grams with stop words outperforms the same algorithm based on n-grams without stop words by more than 10%. From now on we continue development and tuning only using n-grams with stop words.

It is interesting to note that our algorithm predicts the next word for sentences from the news source best. Blogs and Twitter are very close, with Twitter slightly better for exact match and slightly worse for Top-3 and Top-5 prediction rating. The likely explanation is that news tend to use the same expressions more often than less formal blogs and Twitter.

Now, when we know that our algorithm generally works, it’s a time to optimize it.

Optimizing meta-parameters

Until now we have used all collected n-grams to predict the next word using the Stupid Backoff algorithm. As we have calculated, even after all previous optimizations our n-gram tables are too large to be used in a Shiny application. Could we remove some n-grams without losing much prediction power? We are going to answer the question in this section.

Our n-gram tables contain all collected n-grams, but are all of them equally useful for our prediction algorithm? For example, the 5-gram “at the end of the” appears in our corpora 2181 times, whereas the 5-gram “can’t go anywhere with them” appear only once. Intuitively the first n-gram is much more useful, because the probability to encounter it again in other texts is much higher than the probability to find a seldom n-gram in another text.

In the Mileston Report we have also found out that seldom n-grams comprise a very large part of tables for large n. For example, there are 524,351 distinct 1-grams, and 264,033 (50.35%) appear just once. For 5-grams it looks even more dramatic: our corpora contains 42,597,132 distinct 5-grams, and 40,730,276 (95.62%) appear just once. Removing from our tables entries which appear just once could significantly reduce the memory usage.

We will tell that the n-gram table has a threshold \(m\), if we keep in the table only entries which appear in the corpus at least \(m\) times and remove the rest. For example, a n-gram table with the threshold 3 contains only n-grams which appear at least 3 times. Obviously, the complete table with all available n-grams is a table with the threshold 1.

To analyze how removing low-frequency n-grams affects precision of our prediction algorithm, we build n-gram tables with thresholds 2 to 6, and run our prediction algorithm with these tables. Our approach is the same as in the previous section: for each source and threshold we process 100 batches of 1.000 samples each.

source("include/predict.test.R")

for (source in c("blogs", "news", "twitter", "all")) {
    for (thres in 2:6) {
        set.seed(12345)
        predict.test.sb.cache(source, "testing", n.samples = 100000,
                              threshold = thres, removeStopwords = TRUE)
    
        set.seed(12345)
        predict.test.sb.cache(source, "testing", n.samples = 100000,
                              threshold = thres, removeStopwords = FALSE)
    }
}

On our hardware (Intel i5-7600K) commands above runs a bit more than 32 hours.

The chart and table below shows how our prediction quality changes as we increase the threshold.

Prediction precision (%) vs. threshold
	Word correctly predicted				Word in Top 3				Word in Top 5
Threshold	Aggr.	Blogs	News	Twitter	Aggr.	Blogs	News	Twitter	Aggr.	Blogs	News	Twitter
1	16.45	15.40	17.51	15.69	26.34	25.58	27.50	30.19	31.41	30.80	32.56	30.19
2	16.95	16.42	17.66	16.13	26.98	26.74	27.73	30.69	31.89	31.95	32.49	30.69
3	16.63	16.25	17.31	15.68	26.52	26.45	27.13	30.05	31.35	31.59	31.79	30.05
4	16.24	16.06	16.85	15.27	25.99	26.07	26.57	29.54	30.81	31.17	31.20	29.54
5	15.97	15.82	16.51	14.99	25.70	25.80	26.19	29.14	30.48	30.83	30.78	29.14
6	15.73	15.63	16.22	14.73	25.30	25.51	25.82	28.82	30.09	30.53	30.38	28.82
Thres. 6 vs. best	-1.22	-0.79	-1.43	-1.41	-1.67	-1.22	-1.91	-1.87	-1.80	-1.42	-2.17	-1.87

The table above shows how precision of our prediction algorithm changes dependent on the threshold. In each column the cell highlighted by a light green contains the highest value, that is the best precision of the prediction. The last line compares precision with threshold 6 vs. the best precision in the same column.

From the chart and the table we may conclude the following:

In most cases, we achieve the highest precision using threshold 2, that is keeping only n-grams which appears in the corpora at least 2 times. The intuitive explanation of this fact is that the Stupid Backoff algorithm scores longer n-grams significantly higher than shorter n-grams, without taking into account the absolute frequency. A 5-gram which appears just once could be a typo or an extremely seldom expression, but the Stupid Backoff algorithm very often will assign to it a high score. Therefore, removing n-grams which appears just once improves the precision of prediction.
As we increase the threshold, the precision of our prediction is reduced, but not very dramatically. If instead of aiming for the best precision of prediction we use threshold 6, we lose in average 1.5%.

Now let us check size of n-gram tables in memory for various thresholds. The following chart and table shows an impact on table sizes of gradually increasing the threshold from 1 (include all n-grams) to 6 (include only n-grams which appears in the corpora at least 6 times).

Size of n-gram tables in memory
	1-grams		2-grams		3-grams		4-grams		5-grams		Total
Threshold	Size, MiB	% of Max	Size, MiB	% of Max	Size, MiB	% of Max	Size, MiB	% of Max	Size, MiB	% of Max	Size, MiB	% of Max
1	36.0	100.00	155.8	100.00	415.1	100.00	761.9	100.00	837.1	100.00	2205.9	100.00
2	17.7	49.17	49.2	31.58	74.8	18.02	70.0	9.19	38.1	4.55	249.8	11.32
3	13.1	36.39	30.7	19.70	39.0	9.40	30.2	3.96	13.6	1.62	126.6	5.74
4	10.9	30.28	22.9	14.70	26.3	6.34	18.5	2.43	7.7	0.92	86.3	3.91
5	9.5	26.39	18.5	11.87	19.8	4.77	13.1	1.72	5.1	0.61	66.0	2.99
6	8.5	23.61	15.7	10.08	15.8	3.81	10.0	1.31	3.8	0.45	53.8	2.44

As the chart and the table demonstrate, increasing the threshold dramatically reduces the required memory. As we increase the threshold from 1 to 6, the size reduction is less for 1-grams (from 36 MiB to 8.5 MiB, that is to 23.61% of the original), but is pretty spectacular for 5-grams (from 837.1 MiB to 3.8 MiB, that is to 0.45% of the original). The total required memory goes down from 2205.9 MiB to 53.8 MiB, that is to 2.44% of the original.

If we use threshold 6, that is include only n-grams which appears in the corpora at least 6 times, the precision of our prediction is reduced by approximately 1.5%, but required memory falls to just 2.44% of the original, and is acceptable for a Shiny application. This looks like a good deal, so we decided to use n-gram tables with the threshold 6.

Algorithm speed

A note on the speed of the algorithm. For each value of the threshold we processed 100.000 samples for each of the sources (blogs, news and Twitter) as well as additional 100.000 samples aggregated from all sources, for 400.000 samples in total.

Average duration of prediction vs. threshold
Threshold	Duration, hours	No. of samples	Duration per sample, ms
1	03:49:52	400.000	34.5
2	02:42:17	400.000	24.3
3	02:35:16	400.000	23.3
4	02:22:02	400.000	21.3
5	02:13:27	400.000	20.0
6	02:04:35	400.000	18.7

As the chart and table shows, increasing the threshold from 1 to 6 reduces an average prediction duration from 34.5 ms to 18.7 ms, or by 45.80%.

Validation

As the last step, we are going to validate our prediction algorithm on out-of-sample data, that is on the data set we have saved for validation earlier. Our testing approach is the same as before: 100 batches of 1.000 samples each for each corpus, as well as for the aggregated corpora.

source("include/predict.test.R")

for (source in c("blogs", "news", "twitter", "all")) {
    set.seed(12345)
    predict.test.sb.cache(source, "validation", n.samples = 100000,
                          threshold = 6, removeStopwords = TRUE)
    
    set.seed(12345)
    predict.test.sb.cache(source, "validation", n.samples = 100000,
                          threshold = 6, removeStopwords = FALSE)
}

On our hardware (Intel i5-7600K) commands above runs a approximately 2.5 hours.

Prediction precision (validation)
	Word correctly predicted		Word in Top 3		Word in Top 5
Source	Mean, %	Conf. Int. 95%	Mean, %	Conf. Int. 95%	Mean, %	Conf. Int. 95%
Aggregated	15.40	(15.15, 15.65)	25.00	(24.73, 25.26)	29.86	(29.58, 30.14)
Blogs	15.42	(15.19, 15.66)	25.43	(25.13, 25.73)	30.50	(30.20, 30.81)
News	16.16	(15.91, 16.41)	25.67	(25.41, 25.94)	30.28	(30.00, 30.55)
Twitter	14.77	(14.55, 14.98)	24.18	(23.95, 24.41)	29.06	(28.81, 29.31)

The off-sample validation results are very close to those observed before, proving that the model is not overfitted.

Application

A Shiny application using our prediction algorithm is available online: https://serdioa.shinyapps.io/sb-predict.

The application GUI provides the following elements:

Text area: enter the text here, and predicted words will appear on the chart below. The chart shows multiple candidates ordered by score, providing a visual clue on how probable each candidate is. When “Predict partially entered words” is activated, you have to type the space character to predict the next word, otherwise an ending of the current word is predicted.
Number of predictions: choose from 1 to 10 candidates to predict.
Predict partially entered words: select the checkbox to activate the extension which predicts partially entered words.
Random sample: populate the text area with a random sample from a selection of over 1000 prepared sample texts.

The application shows predictions as you type, displaying a chart with ranks of predicted words.

The prediction itself takes less than 20 ms, most of the observed delay is due to the Shiny framework and network latency.

References

(1) HC Corpora provided by corpora.epizy.com. About the corpora. Download the corpora.

(2) Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean. 2007. Large Language Models in Machine Translation. https://www.aclweb.org/anthology/D07-1090.

Next Word Prediction: Stupid Backoff

Alexey Serdyuk

31.07.2019