Table of content

Synopsis
Prerequisites
Obtaining the data
Splitting the data
First glance on the data and general plan
Cleaning up and preprocessing the corpus
Analyzing words (1-grams)
Analyzing bigrams
Pruning bigrams
3-grams to 6-grams
Conclusions and next steps

Synopsis

This is a milestone report for Week 2 of the capstone project for the cycle of courses Data Science Specialization offered on Coursera by Johns Hopkins University.

The purpose of the capstone project is to build a Natural Language Processing (NLP) application, that, given a chunk of text, predicts the next most probable word. The application may be used, for example, in mobile devided to provide suggestions as the user tips in some text.

In this report we will provide initial analysis of the data, as well as discuss approach to building the application.

Prerequisites

An important question is which library to use for processing and analyzing the corpora, as R provides several alternatives. Initially we attempted to use the library tm, but quickly found that the library is very memory-hungry, and an attempt to build bi- or trigrams for a large corpus are not practical. After some googling we decided to use the library quanteda instead.

We start by loading required libraries.

library(data.table) # For fast access in data tables.
library(ggplot2) # For plotting charts.
library(ggforce) # For plotting charts.
library(grid) # For arranging charts in a grid.
library(gridExtra) # For arranging charts in a grid.
library(kableExtra) # For pretty-printing tables.
library(parallel) # For parallel processing.
library(quanteda) # For handling the corpora.
library(readr) # For fast reading/writing.
library(R.utils) # For counting lines in files.
library(stringr) # For operations on strings.
library(tidyverse) # For cleaning up and faster modifications of data tables.

To speed up processing of large data sets, we will apply parallel version of lapply function from the library parallel. To use all the available resources, we detect a number of CPU cores and configure the library to use them all.

cpu.cores <- detectCores()
options(mc.cores = cpu.cores)

Obtaining the data

Here and at some times later we use caching to speed up rendering of this document. Results of long-running operations are stored, and used again during the next run. If you wish to re-run all operations, just remove the cache directory.

if (!dir.exists("cache")) {
    dir.create("cache")
}

We download the data from the URL provided in the course description, and unzip it.

if (!file.exists("cache/Coursera-SwiftKey.zip")) {
    download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                  destfile = "cache/Coursera-SwiftKey.zip", method = "curl")
    unzip("cache/Coursera-SwiftKey.zip", exdir = "cache")
}

Splitting the data

The downloaded zip file contains corpora in several languages: English, German, Russian and Finnish. In our project we will use only English corpora.

Corpora in each language, including English, contains 3 files with content obtained from different sources: news, blogs and twitter.

As the first step, we will split each relevant file on 3 parts:

Training set (60%) will be used to build and train the algorithm.
Testing set (20%) will be used to test the algorithm during it’s development. This set may be used more than once.
Validation set (20%) will be used for a final validation and estimation of out-of-sample performance. This set will be used only once.

We define a function which splits the specified file on parts described above:

# Arguments:
# name - the file to split
# out.dir - output directory
splitFile <- function(name, out.dir) {
    # Reading dataset from the input file.
    data <- read_lines(name)

    # Prepare list with indexes of all data items.
    data.index <- 1:length(data)

    # Sample indices for the training data set, and create a set with remaining
    # indices.
    training.index <- sample(data.index, 0.6 * length(data.index))
    remaining.index <- data.index[! data.index %in% training.index]

    # Sample indices for the testing data set, and use remaining indices
    # for a validation data set.
    testing.index <- sample(remaining.index, 0.5 * length(remaining.index))
    validation.index <- remaining.index[! remaining.index %in% testing.index]

    # Split the data.
    data.training <- data[training.index]
    data.testing <- data[testing.index]
    data.validation <- data[validation.index]
    
    # Create an output directory, if it does not exist.
    if (!dir.exists(out.dir)) {
        dir.create(out.dir)
    }
    
    # Prepare names for output files. We append suffixes "training", "testing"
    # and "validation" to the input file name before the extension.
    base <- basename(name)
    outTraining <- file.path(out.dir, sub("(.)\\.[^.]+$", "\\1.training.txt", base))
    outTesting <- file.path(out.dir, sub("(.)\\.[^.]+$", "\\1.testing.txt", base))
    outValidation <- file.path(out.dir, sub("(.)\\.[^.]+$", "\\1.validation.txt", base))

    # Writing datasets to output files.
    write_lines(data.training, outTraining)
    write_lines(data.testing, outTesting)
    write_lines(data.validation, outValidation)
}

To make results reproduceable, we set the seed of the random number generator.

set.seed(20190530)

Finally, we split each of the data files.

splitFile("cache/final/en_US/en_US.blogs.txt", "cache")
splitFile("cache/final/en_US/en_US.news.txt", "cache")
splitFile("cache/final/en_US/en_US.twitter.txt", "cache")

As a sanity check, we count a number of lines in each source file, as well in the partial files produced by the split.

count.blogs <- R.utils::countLines("cache/final/en_US/en_US.blogs.txt")
count.blogs.training <- R.utils::countLines("cache/en_US.blogs.training.txt")
count.blogs.testing <- R.utils::countLines("cache/en_US.blogs.testing.txt")
count.blogs.validation <- R.utils::countLines("cache/en_US.blogs.validation.txt")

count.news <- R.utils::countLines("cache/final/en_US/en_US.news.txt")
count.news.training <- R.utils::countLines("cache/en_US.news.training.txt")
count.news.testing <- R.utils::countLines("cache/en_US.news.testing.txt")
count.news.validation <- R.utils::countLines("cache/en_US.news.validation.txt")

count.twitter <- R.utils::countLines("cache/final/en_US/en_US.twitter.txt")
count.twitter.training <- R.utils::countLines("cache/en_US.twitter.training.txt")
count.twitter.testing <- R.utils::countLines("cache/en_US.twitter.testing.txt")
count.twitter.validation <- R.utils::countLines("cache/en_US.twitter.validation.txt")

	Blogs		News		Twitter
	Rows	%	Rows	%	Rows	%
Training	539572	59.99991	606145	59.99998	1416088	59.99997
Testing	179858	20.00004	202048	19.99996	472030	20.00002
Validation	179858	20.00004	202049	20.00006	472030	20.00002
Total	899288	100.00000	1010242	100.00000	2360148	100.00000
Control (expected to be 0)	0	NA	0	NA	0	NA

As the table shows, we have splitted the data on sub-sets as intended.

First glance on the data and general plan

In the section above we have already counted a number of lines. Let us load training data sets and take a look on the first 3 lines of each data set.

blogs.text <- read_lines("cache/en_US.blogs.training.txt")
news.text <- read_lines("cache/en_US.news.training.txt")
twitter.text <- read_lines("cache/en_US.twitter.training.txt")

head(blogs.text, 3)

## [1] "a. By “your local” I mean whatever’s local to you – you can decide whether that means Eugene, the UO, your hometown, where you want to work … whatever.)"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [2] "And told me who you are"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
## [3] "Between April 5, 2006 and December 31, 2006, Murphy made no fewer than 18 factually inaccurate statements in her TV commentary about the lacrosse case. She made at least eight more factually inaccurate statements about the case in December 21, 2006 and January 9, 2007 “talking points” forwarded by “victims’ rights” groups, plus at least one factual error in a late 2006 USA Today op-ed. Twenty-seven outright errors of fact on a single case is quite a tally. And that list, of course, doesn’t include Murphy’s misleading statements that were phrased in the form of questions or speculation, or her use of unsubstantiated rumors."

head(news.text, 3)

## [1] "4. earthquakes"                                                                  
## [2] "He had a toy gun and holster, a cowboy shirt and a 10-gallon, er, a 10-pint hat."
## [3] "We'll only know for sure if Cutler himself stays upright."

head(twitter.text, 3)

## [1] "no more mike? Fox dropped mike for him? I think im gonna start watching 9news now."
## [2] "Here . I Lovee You !"                                                              
## [3] "About to get an exam on my shoulder and then some A.R.T."

we could see that the data contains not only words, but also numbers and punctuation. The punctuation may be non-ASCII (Unicode), as the first example in the blogs sample shows (it contains a character “…”, which is different from 3 ASCII point characters “. . .”). Some lines may contain multiple sentences, and probably we have to take this into account.

Here is our plan:

Split text on sentences.
Clean up the corpus: remove non-language parts such as e-mail addresses and URLs, etc.
Preprocess the corpus: remove punctuation and numbers, change all words to lower-case.
Analyze distribution of words to decide if we should base our prediction on the full dictionary, or just on some sub-set of it.
Analyze n-grams for small n.

Cleaning up and preprocessing the corpus

We decided to split text on sentences and do not attempt to predict words across sentence border. We still may use information about sentences to improve prediction of the first word, because the frequency of the first word in a sentence may be very different from an average frequency.

blogs.text <- unlist(tokenizers::tokenize_sentences(blogs.text))
news.text <- unlist(tokenizers::tokenize_sentences(news.text))
twitter.text <- unlist(tokenizers::tokenize_sentences(twitter.text))

Libraries contains some functions for cleaning up and pre-processing, but for some steps we have to write functions ourselves.

# Remove URLs. The regular expression detects http(s) and ftp(s) protocols.
removeUrl <- function(x) gsub("(ht|f)tp(s?)://\\S+", "", x)

# Remove e-mail addresses.
# The regular expression from Stack Overflow:
# https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression
removeEmail <- function(x) gsub("(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])", "", x, perl = TRUE)

# Remove hash tags (the character # and the following word) and twitter handles
# (the character @ and the following word).
removeTagsAndHandles <- function(x) gsub("[@#]\\S+", "", x)

# Surround punctuation marks which does not appear inside a word with space
# characters. Without this step, fragments with a missing space are transformed
# to a single non-existing word when punctuation is removed.
# Example: corpus contains
# "I had the best day yesterday,it was like two kids in a candy store"
# Without this step, "yesterday,it" is transformed to a non-existing word
# "yesterdayit" when removing punctuation. This step transforms it to
# "yesterday, it"
addMissingSpace <- function(x) gsub("[,()\":;”…]", " ", x)

# Replace words in a sentence with replacements available from the table.
# Keep words which are not in the replacement table "as is".
# As a side effect, removes punctuation and transforms to a lower case.
#
# This step is required for several purposes:
# * Replace common short forms with full forms, for example "he'll" = "he will"
replacements.text <- readr::read_csv("replacements.txt",
    col_names = c("token", "replacement"),
    col_types = list(col_character(), col_character()))

replaceWords <- function(text, replacements) {
  # Split text on words.
  tokens.orig <- tokenizers::tokenize_words(text, simplify = TRUE,
                                            strip_numeric = TRUE)
  
  # Attempt to replace each word.
  tokens.replaced <- sapply(tokens.orig, function(x) {
    # Search if a replacement exist.
    replacement.index <- match(x, replacements$token)
    if (is.na(replacement.index)) {
      # Can't find a replacement, fall back on the token itself.
      return (x)
    } else {
      # Replace the token.
      return (replacements$replacement[replacement.index])
    }
  }, USE.NAMES = FALSE)
  
  paste(tokens.replaced, collapse = " ")
}

# Add tokens representing start and end of a sentence.
# SOS = Start Of Sentence
# EOS = End Of Sentence
# When we add these tokens, our text was already transformed to a lower case,
# so we could easy distinguish upper case special tokens from lower case text.
addSentenceTokens <- function(x) paste("SOS", x, "EOS")

# Collapse space characters: if there are more than 1 space character in a row,
# replace with a single one.
collapseWhitespace <- function(x) gsub("\\s+", " ", x)

# ... And now combine all functions in a pre-processing chain.
preProcessText <- function(x) {
    text <- removeUrl(x)
    text <- removeEmail(text)
    text <- removeTagsAndHandles(text)
    text <- addMissingSpace(text)
    text <- replaceWords(text, replacements.text)
    text <- addSentenceTokens(text)
    text <- collapseWhitespace(text)
}

Now we pre-process the data.

blogs.text.preprocessed <- unlist(mclapply(blogs.text, preProcessText))
news.text.preprocessed <- unlist(mclapply(news.text, preProcessText))
twitter.text.preprocessed <- unlist(mclapply(twitter.text, preProcessText))

Analyzing words (1-grams)

In this section we will study distribution of words in corpora, ignoring for the moment interaction between words (n-grams).

We define two helper functions. The first one creates a Document Feature Matrix (DFM) for n-grams in documents, and aggregates it over all documents to a Feature Vector. The second helper function enriches the Feature Vector with additional values useful for our analysis, such as cumulated coverage of text.

# Calculate Document Feature Matrix (DFM) for n-grams in documents,
# and aggregate it over all documents to a Feature Vector.
build.ngram <- function(text, n, min_freq = 1, stop_words = NULL) {
  # Split text on 1-grams.
  text.tokens <- tokens(text)
  
  # Remove stop-words, if required.
  if (!is.null(stop_words)) {
    text.tokens <- tokens_remove(text.tokens, stop_words)
  }
  
  # Stem words. We are using an explicit stemming to make sure that it is fully
  # compatible with another places later in the code where we have to apply
  # the stemming manually.
  text.tokens <- as.tokens(
    mclapply(text.tokens, function(x) SnowballC::wordStem(x, language = "en")))
  
  # Create n-grams, if n > 1
  if (n > 1) {
    text.tokens <- tokens_ngrams(text.tokens, n = n, concatenator = " ")
  }
  
  # Special case: if our corpus contains empty sentences, than 2-grams contains
  # "SOS EOS", that is sequences of "Start-Of-Sentence" + "End-Of-Sentence"
  # tokens. This may happens, for example, if a sentence contains only stop
  # words, or in some weird cases like a twitter which contains only a time
  # like "8:12". We are not interested in empty sequences, so we are removing
  # such tokens.
  if (n == 2) {
    text.tokens <- tokens_remove(text.tokens, c("SOS EOS"))
  }
  
  # Calculate the Document Feature Matrix
  text.dfm <- dfm(text.tokens, tolower = FALSE)
  
  # Remove from DFM least frequent features, if requested.
  if (min_freq > 1) {
    text.dfm <- dfm_trim(text.dfm, min_termfreq = min_freq)
  }
  
  # Sum over all documents.
  colSums(text.dfm)
}

# Sorts a Feature Vector in descending order of frequency and enriches it with
# additional columns:
#  * Cumulated frequency of terms (words or n-grams)
#  * Cumulated frequency as percentage of total.
enrich.ngram <- function(fv) {
  # Transform Feature Vector to a table and sort by frequency descending.
  tbl <- data.table(Terms = names(fv), Freq = fv)
  tbl <- tbl[order(-Freq)]
  
  # Add columns with cumulative frequency as a number of words, and as percentage.
  tbl$Freq.Cum <- cumsum(tbl$Freq)
  tbl$Freq.Cum.Pct <- tbl$Freq.Cum / sum(tbl$Freq)
  
  return (tbl)
}

Now we may calculate frequency of words in each source, as well as in all sources together (aggregated).

# Define stop-words for 1-grams: standard stop words, as well as our special
# tokens "Start-Of-Sentence" and "End-Of-Sentence".
stopwords.tokens <- c(stopwords(), "SOS", "EOS")

# Calculate a frequency of words in each sources, as well as an aggregated.
blogs.1gram.freq <- enrich.ngram(build.ngram(blogs.text.preprocessed, 1,
                                             stop_words = stopwords.tokens))
news.1gram.freq <- enrich.ngram(build.ngram(news.text.preprocessed, 1,
                                            stop_words = stopwords.tokens))
twitter.1gram.freq <- enrich.ngram(build.ngram(twitter.text.preprocessed, 1,
                                               stop_words = stopwords.tokens))
all.text.preprocessed <- c(blogs.text.preprocessed, news.text.preprocessed,
                           twitter.text.preprocessed)
all.1gram.freq <- enrich.ngram(build.ngram(all.text.preprocessed, 1,
                                           stop_words = stopwords.tokens))

The following chart displays 20 most-frequent words in each source, as well as in the aggregated corpora.

As we see from the chart, top 20 most-frequent words differs between sources. For example, the most frequent word in news is “said”, but this word is not included in the top-20 list for blogs and Twitter at all. At the same time, some words are shared between lists: the word “can” is 2nd most-frequent in blogs, 3rd-most frequest in Twitter, and 5th in and news.

Our next step is to analyze the intersection, that is to find how many words are common to all sources, and how many are unique to a particular source. Not only just a number of words is important, but also a source coverage, that is what percentage of the whole text of a particular source is covered by a particular subset of all words.

The following Venn diagram shows a number of unique words (stems) used in each source, as well as a percentage of the aggregated corpora covered by those words.

As we may see, 46686 words are shared by all 3 corpora, but those words cover 97.46% of the aggregated corpora. On the other hand, there are 83185 words unique to blogs, but these words appear very infrequently, covering just 0.43% of the aggregated corpora.

The Venn diagram indicates that we may get a high coverage of all corpora by choosing common words. Coverage by words specific to a particular corpus is negligible.

The next step in our analysis is to find out how many common words we should choose to achieve a decent coverage of the text. From the Venn diagram we already know that by choosing 46686 words we will cover 97.46% of the aggregated corpora, but maybe we may reduce a number of words without significantly reducing the coverage.

The following chart shows a number of unique words in each source which cover particular percentage of the text. For example, 1000 most-frequent words cover 68.09% of the Twitter corpus. An interesting observation is that Twitter requires less words to cover particular percentage of the text, whereas news requires more words.

Corpora Coverage	Blogs	News	Twitter	Aggregated
75%	2,004	2,171	1,539	2,136
90%	6,395	6,718	5,325	6,941
95%	13,369	13,689	11,922	15,002
99%	63,110	53,294	71,575	88,267
99.9%	149,650	126,585	161,873	302,693

The table shows that in order to cover 95% of blogs, we require 13,369 words. The same coverage of news require 13,689 words, and the coverage of twitter 11,922 words. To cover 95% of the aggregated corpora, we require 15,002 unique words. We may use this fact later to reduce a number of n-grams required for predictions.

Analyzing bigrams

In this section we will study distribution bigrams, that is combinations of two words.

Using previously defined functions, we may calculate frequency of bigrams in each source, as well as in all sources together (aggregated).

# Calculate a frequency of words in each sources, as well as an aggregated.
blogs.2gram.freq <- enrich.ngram(build.ngram(blogs.text.preprocessed, 2,
                                             stop_words = stopwords.tokens))
news.2gram.freq <- enrich.ngram(build.ngram(news.text.preprocessed, 2,
                                            stop_words = stopwords.tokens))
twitter.2gram.freq <- enrich.ngram(build.ngram(twitter.text.preprocessed, 2,
                                               stop_words = stopwords.tokens))

all.text.preprocessed <- c(blogs.text.preprocessed, news.text.preprocessed,
                           twitter.text.preprocessed)
all.2gram.freq <- enrich.ngram(build.ngram(all.text.preprocessed, 2,
                                           stop_words = stopwords.tokens))

The following chart displays 20 most-frequent bigrams in each source, as well as in the aggregated corpora.

We immediately see a difference with lists of top 20 words: there were much more common words between sources, as there are common bigrams. There are still some common bigrams, but the intersection is smaller.

Similar to how we proceed with words, now we will analyze intersections, that is we will find how many bigrams are common to all sources, and how many are unique to a particular source. We also calculate a percentage of the whole source covered by a particular subset of all bigrams.

The following Venn diagram shows a number of unique bigrams used in each source, as well as a percentage of the aggregated corpora covered by those bigrams.

The difference between words and bigrams is even more pronounced here. Bigrams common to all sources cover just 46.23% of the text, compared to more than 95% covered by words common to all sources.

The next step in our analysis is to find out how many common bigrams we should choose to achieve a decent coverage of the text.

The following chart shows a number of unique bigrams in each source which cover particular percentage of the text. For example, 1000 most-frequent bigrams cover 8.66% of the Twitter corpus.

Corpora Coverage	Blogs	News	Twitter	Aggregated
75%	1,945,493	1,810,320	1,154,697	2,697,841
90%	3,449,146	3,393,854	2,329,516	6,772,302
95%	3,950,364	3,921,699	2,721,122	8,192,971
99%	4,351,338	4,343,975	3,034,407	9,329,506
99.9%	4,441,557	4,438,987	3,104,896	9,585,226

The table shows that in order to cover 95% of blogs, we require 3,950,364 bigrams. The same coverage of news require 3,921,699 bigrams, and the coverage of Twitter 2,721,122 bigrams. To cover 95% of the aggregated corpora, we require 8,192,971 bigrams.

The chart is also very different from a similar chart for words. The curve for words had an “S”-shape, that is it’s growth slowed down after some number of words, so that adding more words results in diminishing returns. For bigrams, there is no point of diminishing returns: curves are just rising.

As we have found in the section Analyzing words (1-grams), our corpora contains \(N_1=\) 335,906 unique word stems. Potentially there could be \(N_1^2=\) 112,832,840,836 bigrams, but we have observed only \(N_2=\) 9,613,640, that is 0.0085% of all possible. Still, the number of observed bigrams is pretty large. In the section Analyzing words (1-grams) we have found that we may cover large part of the corpus by relatively small number of unique word stems. In the next section we will see if we may reduce a number of unique 2-grams by utilizing that knowledge.

Pruning bigrams

We have found in the section Analyzing words (1-grams), that our corpora contains \(N_1=\) 335,906 unique word stems, but just 15,002 of them cover 95% of the corpus. In this section we will analyze whether we may reduce a number of bigrams by utilizing that knowledge.

We will replace seldom words with a speial token UNK. This will reduce a number of bigrams, because different word sequences may now produce the same bigram, if those word sequences contains seldom words. For example, our word list contains names “Andrei”, “Charley” and “Fabio”, but these words do not belong to a subset of most common words required to achieve 95% coverage of the corpus. If our corpus contains bigrams “Andrei told”, “Charley told” and “Fabio told”, we will replace them all with a bigram “UNK told”.

Since we will apply the same approach to 3-grams, 4-grams etc, to save time we prune the corpora once and save results to files which we may load later.

We start by defining a function that accepts a sequence of words, a white-list and a replacement token. All words in the sentence which are not included in the white-list are replaced by the token.

# Replace words not in the provided white-list with a replacement token.
replaceWordsNotIn <- function(text, whitelist, replacement) {
  # Split text on words.
  tokens.orig <- unlist(tokens(text))

  # Stem words.
  tokens.stem <- SnowballC::wordStem(tokens.orig, language = "en")

  # Check each stem against a whitelist.
  tokens.replaced <- mapply(function(word.orig, word.stem) {
    # Search if the stem in the whitelist.
    replacement.index <- match(word.stem, whitelist)
    if (is.na(replacement.index)) {
      # The word is not in the whitelist: replace it.
      return (replacement)
    } else {
      # The word in the whitelist: use the original.
      return (word.orig)
    }
  }, tokens.orig, tokens.stem, SIMPLIFY = TRUE, USE.NAMES = FALSE)
  
  paste(tokens.replaced, collapse = " ")
}

Now we create a white-list that contains:

frequent words which covers 95% of the corpus,
stop-words, that is functional words like “a” or “the” which we excluded from our word frequency analysis,
special tokens Stop-Of-Sentence and End-Of-Sentence introduced earlier.

# Calculate 95% of the most common words.
words.95 <- all.1gram.freq[Freq.Cum.Pct <= 0.95]$Terms

# Keep 95% of the most common words, all stopwords, as well as special tokens
# SOS/EOS.
words.whitelist <- c(words.95, stopwords(), "SOS", "EOS")

And now we apply the function defined above to replace all words not included in the white-list with the token UNK.

blogs.text.95 <- unlist(mclapply(blogs.text.preprocessed,
                                 replaceWordsNotIn,
                                 words.whitelist, "UNK"))
news.text.95 <- unlist(mclapply(news.text.preprocessed,
                                replaceWordsNotIn,
                                words.whitelist, "UNK"))
twitter.text.95 <- unlist(mclapply(twitter.text.preprocessed,
                                   replaceWordsNotIn,
                                   words.whitelist, "UNK"))

After pruning seldom words, we re-calculate bigrams. From now on, we will analyze only the aggregated corpus.

# Calculate frequency of bigrams in pruned source, or load from cache.
all.text.95 <- c(blogs.text.95, news.text.95, twitter.text.95)
all.2gram.95.freq <- enrich.ngram(build.ngram(all.text.95, 2,
                                              stop_words = stopwords.tokens))

The chart shows coverage of corpora by pruned bigrams, where different types of bigrams are indicated by different color. The chart also shows for several numbers points where bigrams were encountered a particular number of times. For example, there are 104,625 unique bigrams encountered more than 30 times.

By pruning we have reduced a number of unique bigrams from 9,613,640 to 7,341,432, that is by 23.64%. On this stage it is hard to tell whether pruning makes sense: on one hand, it reduces the number of unique 2-grams and thus memory requirements of our application, on the other hand it removes information which may be required to achieve good prediction rate.

3-grams to 6-grams

After analyzing bigrams, now there is a time to take a look on a longer n-grams. We decided to analyze 3-grams to 6-grams.

# Calculate frequency of 3- to 6-grams in pruned source.
all.3gram.95.freq <- enrich.ngram(build.ngram(all.text.95, 3,
                                              stop_words = stopwords.tokens))
all.4gram.95.freq <- enrich.ngram(build.ngram(all.text.95, 4,
                                              stop_words = stopwords.tokens))
all.5gram.95.freq <- enrich.ngram(build.ngram(all.text.95, 5,
                                              stop_words = stopwords.tokens))
all.6gram.95.freq <- enrich.ngram(build.ngram(all.text.95, 6,
                                              stop_words = stopwords.tokens))

Charts below show coverage of corpora by pruned 2- to 6-grams, where different color indicates n-grams with a different number of pruned words (UNK tokens). The same as for 2-grams, charts also show for several number points where n-grams were encountered a particular number of times.

As \(n\) grows, the number of repeated n-grams decreases. This property is quite obvious: for example, there are much more common 2-grams (like “last year” or “good luck”) as common 6-grams. A consequence of this property is less obvious, but is clearly visible on charts: as \(n\) grows, one require more and more unique n-grams to cover the same percentage of the text. For single words we could choose a small subsets that covers 95% of the corpora, but for n-grams achieving a high corpora coverage with a small subset is impossible.

Corpora Coverage	2-grams	3-grams	4-grams	5-grams	6-grams
25%	0.27	8.40	22.18	23.75	24.07
50%	3.06	38.82	48.12	49.17	49.38
75%	20.12	69.41	74.06	74.58	74.69
95%	80.34	93.88	94.81	94.92	94.94

The table above shows a percentage of n-grams required to cover a particular percentage of the aggregated corpus for various n. For example, one require 3.06% of 2-grams to cover 50% of the corpus, but the same coverage requires 38.82% of 3-grams. As we could see, even for 2-grams we can’t significantly reduce a number of unique n-grams without significantly reducing the coverage as well.

Conclusions and next steps

Conclusions from the data analysis:

If we keep only most often used words required for 95% coverage of the corpora, we may significatly reduce the number of distinct words, but the number of 2- to 6-grams is not significantly affected.
To get a decent coverage of the corpora by 2- to 6-grams, we require millions of entries. To reduce memory usage, probably we have to use some encoding shema. Here we may again return to the idea of keeping only most often used words to reduce the size of the encoded data.
For small \(n\), many n-grams are encountered in the corpora multiple times, but for large \(n\) most n-grams are encountered just once. This is intuitively obvious: we expect common bigrams “last year” or “good luck” to be repeated many time, but a probability that a particular 6-gram repeats is pretty low. When developing a prediction model, we have to test if it makes sense at all to include in the model n-grams with large \(n\), since this could be an overfitting to the training set without any benefits for the prediction quality.

Open questions:

Should we include stop-words in our prediction set? On one hand, stop-words are too common and may be considered a syntactical “garbage”, on the other hand they are an important part of a natural language.
How should we use the stemming? On one hand, stemming may reduce the number of n-grams and improve prediction quality. On the other hand, we want to predict full words, not just stems. At the moment we are inclined to stem words used for the prediction, but not the predicted word. This approach may require a custom implementation of the tokenization algorithm.
Should we replace seldom words with the special token UNK or should we keep such words?
Does it make sense to use n-grams for large \(n\), or could we overfit our model with them?
Should we replace common abbreviations with full text before training our algorithm, or should we keep the abbreviations (for example, AFAIK = “as far as I know”)?
Should we remove profanity from the corpora, or should we keep it as a part of the natural language? At the moment we are inclined to keep profanity in the dictionary, but never predict it. Keeping profanity in the dictionary may improve quality of prediction for non-profane words. For example, after the bigram “for fuck’s” we may predict the non-profane word “sake”. If we would have excluded profanity from our dictionary, we may miss the right prediction in this case.

To answer most questions above, we have to create several models and run them against a test data set.

Next steps:

Implement the simplest possible prediction algorithm.
Test the algorithm on the test data set, analyze quality of prediction and optimize the algorithm.

Data Science Capstone Project: Milestone Report

Alexey Serdyuk

01/07/2019