Content

Synopsis

This document provides the background and describes the approach used to implement the application “Next Word Prediction: Stupid Backoff”.

The application was developed as a capstone project for the for the cycle of courses Data Science Specialization offered on Coursera by Johns Hopkins University. The purpose of the capstone project is to build a Natural Language Processing (NLP) application that, given a chunk of text, predicts the next most probable word. The application may be used, for example, in mobile devices to provide suggestions as the user enters text, or as a part of a spelling correction module in a text editor.

You may find the following links useful:

Model

We are going to predicts the next word in a text given a prefix by using a Markov Chain model simplified to n-grams.

The Markov Chain model assumes that in a natural language sentence a probability of each word depends only on previous words. The n-gram model simplifies the Markov Chain model by considering each word to depend only on the previous N words, thus ignoring long-range dependency. The n-gram model combines simplicity with an acceptable prediction quality, making it a model of choice for many Natural Language Processing (NLP) applications.

Downloading and splitting the corpus

Before we start discussing the application in detail, there is a short general consideration. We decided to not include R script sources as the part of this document, but to keep them in separate script files which we reference. Moreover, many steps done when preparing the data and tuning meta-parameters require many hours to run. It would be impossible to re-run all steps each time when we re-render this document. R Markdown provides some possibilities to cache intermediate results, but they are not adequate to our task, since we were not able to keep in memory all intermediate results at once. Hence our approach was to process the data piecewise, explicitly caching intermediate results.

Coursera provides a training text corpora HC Corpora (1). The corpora contains texts in several languages collected from various sources in Web, including blogs, news web sites and Twitter. The English corpora consists of approximately 4.2 millions lines of text.

We start by downloading and unzipping the text corpora. The downloaded data is stored in the directory “cache”.

source("include/download.R")
download.data()

The downloaded zip file contains corpora in several languages: English, German, Russian and Finnish. In our project we will use only English corpora.

Corpora in each language, including English, contains 3 files with content obtained from different sources: news, blogs and twitter.

As the first step, we split each file from the English corpora on 3 parts:

source("include/split.R")
set.seed(20190530)
split.all()

As a sanity check, we count a number of lines in each source file, as well in the partial files produced by the split.

Blogs
News
Twitter
Rows % Rows % Rows %
Training 539572 59.99991 606145 59.99998 1416088 59.99997
Testing 179858 20.00004 202048 19.99996 472030 20.00002
Validation 179858 20.00004 202049 20.00006 472030 20.00002
Total 899288 100.00000 1010242 100.00000 2360148 100.00000
Control (expected to be 0) 0 NA 0 NA 0 NA

As the table shows, we have splitted the data on sub-sets as intended.

Cleaning and preprocessing the corpus

We decided to split text on sentences and do not attempt to predict words across sentence border. We still may use information about sentences to improve prediction of the first word, because the frequency of the first word in a sentence may be very different from an average frequency.

We apply the following cleaning and pre-processing steps:

Natural language processing has a notion of stop words, that is words which occure very often, and are often filtered out during processing. It is not obvious on this step if we should include or ignore stop words for the purposes of our task, so we decided to continue in parallel with both approaches. Thus our pre-processing function has one more optional step which removes stop words.

source("include/preprocess.R")
preprocess.all(removeStopwords = FALSE)
preprocess.all(removeStopwords = TRUE)

On our hardware (Intel i5-7600K) commands above runs several minutes.

Building n-grams

In the previous section we have pre-processed the corpora: removed URLs and e-mail addresses, get rid of the punctuation etc. Our next task is to build n-grams and corresponding frequency tables.

An important question is which library to use for processing and analyzing the corpora, as R provides several alternatives. Initially we attempted to use the library tm, but quickly found that the library is very memory-hungry, and an attempt to build bi- or trigrams for a large corpus are not practical. After some googling we decided to use the library quanteda instead.

When building n-grams, we apply the following principles:

source("include/ngram.build.R")
for (n in 1:5) {
  ngram.freq.cache(n, removeStopwords = FALSE)
  ngram.freq.cache(n, removeStopwords = TRUE)
}

On our hardware (Intel i5-7600K) commands above runs approximately 6 hours.

As a result of this step, for 1- to 5-grams we obtain frequency tables with the columns Term (the n-gram \(w_1,\ldots,w_n\)) and Freq (the number of times \(c(w_1,\ldots,w_n)\) this n-gram appears in the aggregated corpus).

For the purpose of our task, that is prediction of the next word, we require the frequency table to be transformed. For each n-gram \(w_1,\ldots,w_n\), the transformed table shall contain:

\[P_{ML}(w_n | w_1,\ldots,w_{n-1}) = \frac{c(w_1,\ldots,w_n)} {\sum c(w_1,\ldots,w_{n-1},\bullet)}\]

The sum in the denominator is over all n-grams sharing the same Prefix \(w_1,\ldots,w_{n-1}\).

source("include/ngram.build.R")
for (n in 1:5) {
  ngram.extended.cache(n, removeStopwords = FALSE)
  ngram.extended.cache(n, removeStopwords = TRUE)
}

On our hardware (Intel i5-7600K) commands above runs a bit more than 1 hour and use slightly over 30 GB of memory.

The following table shows number of rows and memory usage of n-gram tables with and without stop words.

With stop words
Without stop words
N Rows Size, MiB Rows Size, MiB
1 524,351 42.0 523,757 42.0
2 8,240,587 222.3 11,449,151 295.7
3 25,403,862 972.1 23,256,283 1,063.5
4 38,589,792 2,413.9 22,993,832 1,941.9
5 42,597,132 3,614.9 19,821,557 1,955.6
Total: 7,265.2 5,298.7

As the table shows, tables are very large and do not satisfy requirements for a Shiny application (max. 1 GB). Besides, with very large tables we may expect a large latency. In the next sections we will reduce the size of n-gram tables.

Binary encoding for stems

As we have found in the previous section, our n-gram tables are very large. A very significant part of the memory usage is due to the prefixes. We decided to reduce the memory usage by using a binary encoding for the prefixes.

Although R has the data type raw which works directly with bytes, the storage of this data type is very inefficient, besides, in our experience it is very inconvenient to use as a column in a data frame. We decided to encode stems as R data type integer (4 bytes) and numeric (8 bytes) instead.

Theoretically the Huffman coding may have been used to compress the data and reduce the storage size, but in our case it is not practical: since our prefixes contains up to 4 stems, and we want to encode \(2^{16}\) tokens, even by using the Huffmann encoding we will not be able to compress each prefix under 4 bytes, so we have to use 8-bytes storage anyway. To simplify the implementation, we decided for a simple non-compressing encoding instead.

Let us walk through several examples. We will encode the 4-gram “the answer to that”, as well as it’s 1-, 2- and 3-word prefixes.

First, let us lookup codes for each stem in our 4-gram:

source("include/ngram.encode.R")
dict.stems <- dict.stems.cache()
dict.hash <- dict.hash.cache()
dict.hash[[c("the", "answer", "to", "that")]]
## [1]   2 619   3  11

The corresponding hexadecimal codes are 0x0002, 0x026B, 0x0003, 0x000B.

We start by encoding and decoding the first word, “the”. For a 1-gram an encoded value is just an integer equal to the code, that is the encoded value is 2.

enc.1 <- tokens.encode.1(c("the"), dict.hash)
enc.1
## [1] 2
tokens.decode.1(enc.1, dict.stems)
## [1] "the"

We continue with the 2-gram “the answer”. Concatenated hexadecimal codes are 0x0002026B. Interpreted as an integer (little-endian), that gives us 1795293696.

enc.2 <- tokens.encode.2(c("the", "answer"), dict.hash)
enc.2
## [1] 1795293696
tokens.decode.2(enc.2, dict.stems)
## [1] "the"    "answer"

Next we encode the 3-gram “the answer to”. We concatenate the hexadecimal codes and append two 0x00 bytes to get 8 bytes in total to obtain 0x0002026B00030000. Interpreted as a numeric (little-endian), that gives us 1.6305797604007e-311. Note that the decimal notation is not precise enough: to correctly decode the value, you require absolutely exact value which preserves every single bit.

enc.3 <- tokens.encode.3(c("the", "answer", "to"), dict.hash)
enc.3
## [1] 1.63058e-311
tokens.decode.3(enc.3, dict.stems)
## [1] "the"    "answer" "to"

Finally, we encode the 4-gram “the answer to that”. We concatenate the hexadecimal codes to obtain 8-bytes sequence 0x0002026B0003000B, and interpret it as a numeric (little-endian) 1.0663795695223075e-255. The same note about the decimal notation as for the 3-gram applies here as well.

enc.4 <- tokens.encode.4(c("the", "answer", "to", "that"), dict.hash)
enc.4
## [1] 1.06638e-255
tokens.decode.4(enc.4, dict.stems)
## [1] "the"    "answer" "to"     "that"

We will use the binary encoding of prefixes in the next section to reduce the size of n-gram tables.

Optimizing n-gram tables

In the section Building n-grams we built n-gram tables, and found that the tables are too large. In the previous section Binary encoding for stems we have developed a method to encode prefixes of n-grams either as integer or as numeric, dependent on a size of the prefix.

Let us apply the encoding to prefixes of our n-gram tables. We also use this opportunity to remove the column Prefix from 1-grams since it is always NA anyway.

source("include/ngram.optimize.R")
for (n in 1:5) {
  ngram.optimize.prefix.cache(n, removeStopwords = FALSE)
  ngram.extended.prefix.cache(n, removeStopwords = TRUE)
}

On our hardware (Intel i5-7600K) commands above runs a bit less than 4 hours.

The following table shows size of n-gram tables with encoded prefixes compared to original tables. The improvement is more pronounced for larger n. For instance, the size of the 5-gram table with stop words is reduced from 3614.9 MiB to 999.6 MiB, or just to 27.65% of the former size.

With stop words
Without stop words
N Rows Original Size, MiB Optimized Size, MiB Optimized vs. Original, % Rows Original Size, MiB Optimized Size, MiB Optimized vs. Original, %
1 524,351 42.0 38.0 90.48 523,757 42.0 38.0 90.48
2 8,240,587 222.3 187.2 84.21 11,449,151 295.7 248.4 84.00
3 25,403,862 972.1 512.0 52.67 23,256,283 1,063.5 470.3 44.22
4 38,589,792 2,413.9 909.1 37.66 22,993,832 1,941.9 550.5 28.35
5 42,597,132 3,614.9 999.6 27.65 19,821,557 1,955.6 475.6 24.32
Total: 7,265.2 2,645.9 36.42 5,298.7 1,782.8 33.65

Another opportunity for reducing the required memory is the column Prob which contains the maximum likehood estimate of the conditional probability of the Suffix given the Prefix. Instead of storing it as a numeric column (8 bytes for each value) we may squeeze it into the integer (4 bytes) using the scaled log approach. The easiest way to explain this approach is with an example:

\[P_{ML}(\text{the} | \text{at}) = \frac{c(\text{at the})} {\sum c(\text{at } \bullet)} = \frac{2035}{330145} \approx 0.006163958\]

for (n in 1:5) {
  ngram.optimize.prob.cache(n, removeStopwords = FALSE)
  ngram.extended.prob.cache(n, removeStopwords = TRUE)
}

On our hardware (Intel i5-7600K) commands above runs approximately 5 minutes.

The folowing table shows size of n-gram tables with probabilities optimized for low memory consumption compared to the previous version with encoded prefixes. The size reduction is not as significant as in the previous step. For instance, the total size of n-gram tables with stop words is reduced from 2645.9 MiB to 2205.9 MiB, or to 83.37% of the former size.

With stop words
Without stop words
N Rows Original Size, MiB Optimized Size, MiB Optimized vs. Original, % Rows Original Size, MiB Optimized Size, MiB Optimized vs. Original, %
1 524,351 38.0 36.0 94.74 523,757 38.0 36.0 94.74
2 8,240,587 187.2 155.8 83.23 11,449,151 248.4 204.7 82.41
3 25,403,862 512.0 415.1 81.07 23,256,283 470.3 381.6 81.14
4 38,589,792 909.1 761.9 83.81 22,993,832 550.5 462.7 84.05
5 42,597,132 999.6 837.1 83.74 19,821,557 475.6 400.0 84.10
Total: 2,645.9 2,205.9 83.37 1,782.8 1,485.0 83.30

The total memory usage is still above our maximum allowed size of 1 GB. To achieve it, we have to find another ways to reduce the tables with n-grams. But before we do it, let us discuss our prediction algorithm and run some tests.

Stupid Backoff algorithm

The prediction algorithm used by our application consists of 2 steps: choosing candidates, and scoring candidates to select the best matches.

We decided to use very simple choosing algorithm: we choose all available candidates, proceeding from 5-grams to 1-grams and skipping suffixes which we have already collected. More precisely, given the input text \(w_1, w_2,\ldots,w_{n-1}\), we proceed as follows:

Note that we do not stop once we have choosen the required number of candidates, but continue to n-grams for lower n. Suppose that we were requested to predict 3 most probable words, and there are exactly 3 5-grams with the appropriate prefix. If we stop choosing candidates at this moment, we may miss 4-gram which produces a prediction with higher score than any 5-gram.

The scoring algorithm selects best N matches from all candidates by calculating a numerical score of each candidate and choosing N candidates with top score. The score may be a probability, but it may be a different type of a numerical quantifier, as it is the case for the Stupid Backoff algorithm.

The Stupid Backoff algorithm is a high-efficient scoring algorithm proposed in 2007 by Thorsten Brants et al (2). On large data sets the algorithm gives scoring close to Kneser-Ney algorithm, but is significantly faster. The Stupid Backoff algorithm returns not probabilities, but relative scores of words (they do not sum to 1), which is sufficient for our purposes.

The Stupid Backoff algorithm is described by the following formula:

\[SB(w_n|w_1, w_2,\ldots,w_{n-1}) = \begin{cases} P_{ML}(w_n|w_1, w_2,\ldots,w_{n-1}) \text{ if } P_{ML}(w_n|w_1, w_2,\ldots,w_{n-1}) > 0 \\ \lambda \; SB(w_n|w_2,\ldots,w_{n-1}) \text{ otherwise} \end{cases}\]

where \(P_{ML}(w_n|w_1, w_2,\ldots,w_{n-1})\) is the maximum likehood estimation of the conditional probability of the word \(w_n\) given the prefix \(w_1, w_2,\ldots,w_{n-1}\). Authors of the Stupid Backoff algorithm recommend to use \(\lambda = 0.4\).

In other words, first we attempt to look up the n-gram in the table for the largest n available. If the n-gram is found, than the score of the last word is the maximum likehood estimation of the conditional probability of the last word given the prefix. Otherwise, we back off (hence the algorithm name) to a table with (n-1)-grams, do the same calculations and multiply the result by \(\lambda = 0.4\). If the shorterned prefix is not found as well, the recursion goes deeper, concluding on 1-grams.

Note that our n-gram tables already contains the required maximum likehood estimation (MLE), so our calculation of the score is trivial:

To better understand the algorithm, we will walk through an example. We want to predict the next word after the sequence “you pay to get”.

Prefix Suffix Probability Score
you pay to get in 0.50 0.50
you pay to get published 0.25 0.25
you pay to get there 0.25 0.25

We add all these matches to the list of candidates.

Prefix Suffix Probability Score
pay to get in 0.1430 0.05720
pay to get the 0.1140 0.04560
pay to get a 0.0857 0.03428
pay to get my 0.0857 0.03428
pay to get their 0.0571 0.02284

The first suffix, “in”, is already in our list, since we have already found it in the table with 5-grams. The next suffix, “the”, is not in our list yet, so we add it. We continue the same way for all 4-grams, adding suffixes which are not in our list yet.

Prefix Suffix Probability Score
to get a 0.0858 0.013728
to get the 0.0700 0.011200
to get to 0.0508 0.008128
to get out 0.0338 0.005408
to get back 0.0334 0.005344

We add to our list candidates which are not present it in yet.

Prefix Suffix Probability Score
you pay to get in 0.50000 0.5000
you pay to get published 0.25000 0.2500
you pay to get there 0.25000 0.2500
pay to get the 0.11420 0.0457
pay to get a 0.08571 0.0342
pay to get my 0.08570 0.0342
pay to get their 0.05710 0.0228
pay to get out 0.05710 0.0228
pay to get inside 0.02850 0.0114
pay to get past 0.02857 0.0114

Extension of the algorithm

We added to the base algoritm a simple optional extension which allows to predict a word the user is currently typing. As long as the last typed character is not the space character, the algorithm assumes that the user continues to type the current word, and predicts it. Only after a space character is typed the algorithm predicts the next word.

The extension pretty easy integrates in the step when we choose candidate words. Given the prefix \(w_1,w_2,\ldots,w_{n-1}\) and a partially typed word \(w^\prime_n\) we choose all 5-gram candidates which starts with the prefix \(w_1,w_2,\ldots,w_{n-1}\) and the 5th word starts with the word prefix \(w^\prime_n\), and similarly for 4- to 1-grams.

Some words may be missing in our dictionary, and the user may misspell some words. If we can’t find any candidates using our n-grams, we attempt to predict the next word using the partialy entered word by applying a spelling correction algorithm provided by the R package hunspell. For example, if the user enters a misspelled word “mashine”, we propose the spell-corrected word “machine”.

Testing algorithm

Now we are ready to test our algorithm for predicting the next word using 20% of the data we have saved for testing before. We run tests for each data source, as well as for the aggregated data.

source("include/predict.test.R")

for (source in c("blogs", "news", "twitter", "all")) {
    set.seed(12345)
    predict.test.sb.cache(source, "testing", n.samples = 100000,
                          threshold = 1, removeStopwords = TRUE)
    
    set.seed(12345)
    predict.test.sb.cache(source, "testing", n.samples = 100000,
                          threshold = 1, removeStopwords = FALSE)
}

The meaning of the parameter “threshold” will be explained later, when we will optimize meta-parameters of the prediction algorithm. Now it is enough to say that “threshold = 1” uses complete tables with n-grams we have built before.

On our hardware (Intel i5-7600K) commands above runs a bit more than 7 hours.

The chart below shows the the precision of our prediction algorithm for the aggregated data sources, as well as separately for blogs, news and Twitter, with and without stop words. The first chart column shows percentage of correct predictions (our top candidate is exactly the word present in the original text). The second and third chart columns show percentage of cases when the right word was amongst our Top-3 or Top-5 predicted candidates.

Prediction precision, with stop words
Word correctly predicted
Word in Top 3
Word in Top 5
Source Mean, % Conf. Int. 95% Mean, % Conf. Int. 95% Mean, % Conf. Int. 95%
Aggregated 16.45 (16.20, 16.70) 26.34 (26.06, 26.62) 31.41 (31.11, 31.70)
Blogs 15.40 (15.19, 15.62) 25.58 (25.30, 25.86) 30.80 (30.53, 31.08)
News 17.51 (17.28, 17.74) 27.50 (27.22, 27.78) 32.56 (32.27, 32.85)
Twitter 15.69 (15.42, 15.95) 25.35 (25.06, 25.63) 30.19 (29.89, 30.49)
Prediction precision, without stop words
Word correctly predicted
Word in Top 3
Word in Top 5
Source Mean, % Conf. Int. 95% Mean, % Conf. Int. 95% Mean, % Conf. Int. 95%
Aggregated 4.00 (3.88, 4.13) 6.13 (5.98, 6.29) 7.29 (7.12, 7.46)
Blogs 2.73 (2.61, 2.85) 4.45 (4.31, 4.59) 5.44 (5.28, 5.60)
News 5.16 (5.02, 5.31) 7.60 (7.43, 7.77) 8.87 (8.69, 9.04)
Twitter 3.92 (3.78, 4.06) 6.10 (5.93, 6.27) 7.31 (7.13, 7.49)

The first important insight from the chart and tables above is that the prediction based on n-grams with stop words outperforms the same algorithm based on n-grams without stop words by more than 10%. From now on we continue development and tuning only using n-grams with stop words.

It is interesting to note that our algorithm predicts the next word for sentences from the news source best. Blogs and Twitter are very close, with Twitter slightly better for exact match and slightly worse for Top-3 and Top-5 prediction rating. The likely explanation is that news tend to use the same expressions more often than less formal blogs and Twitter.

Now, when we know that our algorithm generally works, it’s a time to optimize it.

Optimizing meta-parameters

Until now we have used all collected n-grams to predict the next word using the Stupid Backoff algorithm. As we have calculated, even after all previous optimizations our n-gram tables are too large to be used in a Shiny application. Could we remove some n-grams without losing much prediction power? We are going to answer the question in this section.

Our n-gram tables contain all collected n-grams, but are all of them equally useful for our prediction algorithm? For example, the 5-gram “at the end of the” appears in our corpora 2181 times, whereas the 5-gram “can’t go anywhere with them” appear only once. Intuitively the first n-gram is much more useful, because the probability to encounter it again in other texts is much higher than the probability to find a seldom n-gram in another text.

In the Mileston Report we have also found out that seldom n-grams comprise a very large part of tables for large n. For example, there are 524,351 distinct 1-grams, and 264,033 (50.35%) appear just once. For 5-grams it looks even more dramatic: our corpora contains 42,597,132 distinct 5-grams, and 40,730,276 (95.62%) appear just once. Removing from our tables entries which appear just once could significantly reduce the memory usage.

We will tell that the n-gram table has a threshold \(m\), if we keep in the table only entries which appear in the corpus at least \(m\) times and remove the rest. For example, a n-gram table with the threshold 3 contains only n-grams which appear at least 3 times. Obviously, the complete table with all available n-grams is a table with the threshold 1.

To analyze how removing low-frequency n-grams affects precision of our prediction algorithm, we build n-gram tables with thresholds 2 to 6, and run our prediction algorithm with these tables. Our approach is the same as in the previous section: for each source and threshold we process 100 batches of 1.000 samples each.

source("include/predict.test.R")

for (source in c("blogs", "news", "twitter", "all")) {
    for (thres in 2:6) {
        set.seed(12345)
        predict.test.sb.cache(source, "testing", n.samples = 100000,
                              threshold = thres, removeStopwords = TRUE)
    
        set.seed(12345)
        predict.test.sb.cache(source, "testing", n.samples = 100000,
                              threshold = thres, removeStopwords = FALSE)
    }
}

On our hardware (Intel i5-7600K) commands above runs a bit more than 32 hours.

The chart and table below shows how our prediction quality changes as we increase the threshold.

Prediction precision (%) vs. threshold
Word correctly predicted
Word in Top 3
Word in Top 5
Threshold Aggr. Blogs News Twitter Aggr. Blogs News Twitter Aggr. Blogs News Twitter
1 16.45 15.40 17.51 15.69 26.34 25.58 27.50 30.19 31.41 30.80 32.56 30.19
2 16.95 16.42 17.66 16.13 26.98 26.74 27.73 30.69 31.89 31.95 32.49 30.69
3 16.63 16.25 17.31 15.68 26.52 26.45 27.13 30.05 31.35 31.59 31.79 30.05
4 16.24 16.06 16.85 15.27 25.99 26.07 26.57 29.54 30.81 31.17 31.20 29.54
5 15.97 15.82 16.51 14.99 25.70 25.80 26.19 29.14 30.48 30.83 30.78 29.14
6 15.73 15.63 16.22 14.73 25.30 25.51 25.82 28.82 30.09 30.53 30.38 28.82
Thres. 6 vs. best -1.22 -0.79 -1.43 -1.41 -1.67 -1.22 -1.91 -1.87 -1.80 -1.42 -2.17 -1.87

The table above shows how precision of our prediction algorithm changes dependent on the threshold. In each column the cell highlighted by a light green contains the highest value, that is the best precision of the prediction. The last line compares precision with threshold 6 vs. the best precision in the same column.

From the chart and the table we may conclude the following:

Now let us check size of n-gram tables in memory for various thresholds. The following chart and table shows an impact on table sizes of gradually increasing the threshold from 1 (include all n-grams) to 6 (include only n-grams which appears in the corpora at least 6 times).

Size of n-gram tables in memory
1-grams
2-grams
3-grams
4-grams
5-grams
Total
Threshold Size, MiB % of Max Size, MiB % of Max Size, MiB % of Max Size, MiB % of Max Size, MiB % of Max Size, MiB % of Max
1 36.0 100.00 155.8 100.00 415.1 100.00 761.9 100.00 837.1 100.00 2205.9 100.00
2 17.7 49.17 49.2 31.58 74.8 18.02 70.0 9.19 38.1 4.55 249.8 11.32
3 13.1 36.39 30.7 19.70 39.0 9.40 30.2 3.96 13.6 1.62 126.6 5.74
4 10.9 30.28 22.9 14.70 26.3 6.34 18.5 2.43 7.7 0.92 86.3 3.91
5 9.5 26.39 18.5 11.87 19.8 4.77 13.1 1.72 5.1 0.61 66.0 2.99
6 8.5 23.61 15.7 10.08 15.8 3.81 10.0 1.31 3.8 0.45 53.8 2.44

As the chart and the table demonstrate, increasing the threshold dramatically reduces the required memory. As we increase the threshold from 1 to 6, the size reduction is less for 1-grams (from 36 MiB to 8.5 MiB, that is to 23.61% of the original), but is pretty spectacular for 5-grams (from 837.1 MiB to 3.8 MiB, that is to 0.45% of the original). The total required memory goes down from 2205.9 MiB to 53.8 MiB, that is to 2.44% of the original.

If we use threshold 6, that is include only n-grams which appears in the corpora at least 6 times, the precision of our prediction is reduced by approximately 1.5%, but required memory falls to just 2.44% of the original, and is acceptable for a Shiny application. This looks like a good deal, so we decided to use n-gram tables with the threshold 6.

Algorithm speed

A note on the speed of the algorithm. For each value of the threshold we processed 100.000 samples for each of the sources (blogs, news and Twitter) as well as additional 100.000 samples aggregated from all sources, for 400.000 samples in total.

Average duration of prediction vs. threshold
Threshold Duration, hours No. of samples Duration per sample, ms
1 03:49:52 400.000 34.5
2 02:42:17 400.000 24.3
3 02:35:16 400.000 23.3
4 02:22:02 400.000 21.3
5 02:13:27 400.000 20.0
6 02:04:35 400.000 18.7

As the chart and table shows, increasing the threshold from 1 to 6 reduces an average prediction duration from 34.5 ms to 18.7 ms, or by 45.80%.

Validation

As the last step, we are going to validate our prediction algorithm on out-of-sample data, that is on the data set we have saved for validation earlier. Our testing approach is the same as before: 100 batches of 1.000 samples each for each corpus, as well as for the aggregated corpora.

source("include/predict.test.R")

for (source in c("blogs", "news", "twitter", "all")) {
    set.seed(12345)
    predict.test.sb.cache(source, "validation", n.samples = 100000,
                          threshold = 6, removeStopwords = TRUE)
    
    set.seed(12345)
    predict.test.sb.cache(source, "validation", n.samples = 100000,
                          threshold = 6, removeStopwords = FALSE)
}

On our hardware (Intel i5-7600K) commands above runs a approximately 2.5 hours.

Prediction precision (validation)
Word correctly predicted
Word in Top 3
Word in Top 5
Source Mean, % Conf. Int. 95% Mean, % Conf. Int. 95% Mean, % Conf. Int. 95%
Aggregated 15.40 (15.15, 15.65) 25.00 (24.73, 25.26) 29.86 (29.58, 30.14)
Blogs 15.42 (15.19, 15.66) 25.43 (25.13, 25.73) 30.50 (30.20, 30.81)
News 16.16 (15.91, 16.41) 25.67 (25.41, 25.94) 30.28 (30.00, 30.55)
Twitter 14.77 (14.55, 14.98) 24.18 (23.95, 24.41) 29.06 (28.81, 29.31)

The off-sample validation results are very close to those observed before, proving that the model is not overfitted.

Application

A Shiny application using our prediction algorithm is available online: https://serdioa.shinyapps.io/sb-predict.

The application GUI provides the following elements:

References

(1) HC Corpora provided by corpora.epizy.com. About the corpora. Download the corpora.

(2) Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, Jeffrey Dean. 2007. Large Language Models in Machine Translation. https://www.aclweb.org/anthology/D07-1090.