This is a milestone report for Week 2 of the capstone project for the cycle of courses Data Science Specialization offered on Coursera by Johns Hopkins University.
The purpose of the capstone project is to build a Natural Language Processing (NLP) application, that, given a chunk of text, predicts the next most probable word. The application may be used, for example, in mobile devided to provide suggestions as the user tips in some text.
In this report we will provide initial analysis of the data, as well as discuss approach to building the application.
An important question is which library to use for processing and analyzing the corpora, as R provides several alternatives. Initially we attempted to use the library tm
, but quickly found that the library is very memory-hungry, and an attempt to build bi- or trigrams for a large corpus are not practical. After some googling we decided to use the library quanteda
instead.
We start by loading required libraries.
library(data.table) # For fast access in data tables.
library(ggplot2) # For plotting charts.
library(ggforce) # For plotting charts.
library(grid) # For arranging charts in a grid.
library(gridExtra) # For arranging charts in a grid.
library(kableExtra) # For pretty-printing tables.
library(parallel) # For parallel processing.
library(quanteda) # For handling the corpora.
library(readr) # For fast reading/writing.
library(R.utils) # For counting lines in files.
library(stringr) # For operations on strings.
library(tidyverse) # For cleaning up and faster modifications of data tables.
To speed up processing of large data sets, we will apply parallel version of lapply
function from the library parallel
. To use all the available resources, we detect a number of CPU cores and configure the library to use them all.
cpu.cores <- detectCores()
options(mc.cores = cpu.cores)
Here and at some times later we use caching to speed up rendering of this document. Results of long-running operations are stored, and used again during the next run. If you wish to re-run all operations, just remove the cache
directory.
if (!dir.exists("cache")) {
dir.create("cache")
}
We download the data from the URL provided in the course description, and unzip it.
if (!file.exists("cache/Coursera-SwiftKey.zip")) {
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "cache/Coursera-SwiftKey.zip", method = "curl")
unzip("cache/Coursera-SwiftKey.zip", exdir = "cache")
}
The downloaded zip file contains corpora in several languages: English, German, Russian and Finnish. In our project we will use only English corpora.
Corpora in each language, including English, contains 3 files with content obtained from different sources: news, blogs and twitter.
As the first step, we will split each relevant file on 3 parts:
We define a function which splits the specified file on parts described above:
# Arguments:
# name - the file to split
# out.dir - output directory
splitFile <- function(name, out.dir) {
# Reading dataset from the input file.
data <- read_lines(name)
# Prepare list with indexes of all data items.
data.index <- 1:length(data)
# Sample indices for the training data set, and create a set with remaining
# indices.
training.index <- sample(data.index, 0.6 * length(data.index))
remaining.index <- data.index[! data.index %in% training.index]
# Sample indices for the testing data set, and use remaining indices
# for a validation data set.
testing.index <- sample(remaining.index, 0.5 * length(remaining.index))
validation.index <- remaining.index[! remaining.index %in% testing.index]
# Split the data.
data.training <- data[training.index]
data.testing <- data[testing.index]
data.validation <- data[validation.index]
# Create an output directory, if it does not exist.
if (!dir.exists(out.dir)) {
dir.create(out.dir)
}
# Prepare names for output files. We append suffixes "training", "testing"
# and "validation" to the input file name before the extension.
base <- basename(name)
outTraining <- file.path(out.dir, sub("(.)\\.[^.]+$", "\\1.training.txt", base))
outTesting <- file.path(out.dir, sub("(.)\\.[^.]+$", "\\1.testing.txt", base))
outValidation <- file.path(out.dir, sub("(.)\\.[^.]+$", "\\1.validation.txt", base))
# Writing datasets to output files.
write_lines(data.training, outTraining)
write_lines(data.testing, outTesting)
write_lines(data.validation, outValidation)
}
To make results reproduceable, we set the seed of the random number generator.
set.seed(20190530)
Finally, we split each of the data files.
splitFile("cache/final/en_US/en_US.blogs.txt", "cache")
splitFile("cache/final/en_US/en_US.news.txt", "cache")
splitFile("cache/final/en_US/en_US.twitter.txt", "cache")
As a sanity check, we count a number of lines in each source file, as well in the partial files produced by the split.
count.blogs <- R.utils::countLines("cache/final/en_US/en_US.blogs.txt")
count.blogs.training <- R.utils::countLines("cache/en_US.blogs.training.txt")
count.blogs.testing <- R.utils::countLines("cache/en_US.blogs.testing.txt")
count.blogs.validation <- R.utils::countLines("cache/en_US.blogs.validation.txt")
count.news <- R.utils::countLines("cache/final/en_US/en_US.news.txt")
count.news.training <- R.utils::countLines("cache/en_US.news.training.txt")
count.news.testing <- R.utils::countLines("cache/en_US.news.testing.txt")
count.news.validation <- R.utils::countLines("cache/en_US.news.validation.txt")
count.twitter <- R.utils::countLines("cache/final/en_US/en_US.twitter.txt")
count.twitter.training <- R.utils::countLines("cache/en_US.twitter.training.txt")
count.twitter.testing <- R.utils::countLines("cache/en_US.twitter.testing.txt")
count.twitter.validation <- R.utils::countLines("cache/en_US.twitter.validation.txt")
Rows | % | Rows | % | Rows | % | |
---|---|---|---|---|---|---|
Training | 539572 | 59.99991 | 606145 | 59.99998 | 1416088 | 59.99997 |
Testing | 179858 | 20.00004 | 202048 | 19.99996 | 472030 | 20.00002 |
Validation | 179858 | 20.00004 | 202049 | 20.00006 | 472030 | 20.00002 |
Total | 899288 | 100.00000 | 1010242 | 100.00000 | 2360148 | 100.00000 |
Control (expected to be 0) | 0 | NA | 0 | NA | 0 | NA |
As the table shows, we have splitted the data on sub-sets as intended.
In the section above we have already counted a number of lines. Let us load training data sets and take a look on the first 3 lines of each data set.
blogs.text <- read_lines("cache/en_US.blogs.training.txt")
news.text <- read_lines("cache/en_US.news.training.txt")
twitter.text <- read_lines("cache/en_US.twitter.training.txt")
head(blogs.text, 3)
## [1] "a. By “your local” I mean whatever’s local to you – you can decide whether that means Eugene, the UO, your hometown, where you want to work … whatever.)"
## [2] "And told me who you are"
## [3] "Between April 5, 2006 and December 31, 2006, Murphy made no fewer than 18 factually inaccurate statements in her TV commentary about the lacrosse case. She made at least eight more factually inaccurate statements about the case in December 21, 2006 and January 9, 2007 “talking points” forwarded by “victims’ rights” groups, plus at least one factual error in a late 2006 USA Today op-ed. Twenty-seven outright errors of fact on a single case is quite a tally. And that list, of course, doesn’t include Murphy’s misleading statements that were phrased in the form of questions or speculation, or her use of unsubstantiated rumors."
head(news.text, 3)
## [1] "4. earthquakes"
## [2] "He had a toy gun and holster, a cowboy shirt and a 10-gallon, er, a 10-pint hat."
## [3] "We'll only know for sure if Cutler himself stays upright."
head(twitter.text, 3)
## [1] "no more mike? Fox dropped mike for him? I think im gonna start watching 9news now."
## [2] "Here . I Lovee You !"
## [3] "About to get an exam on my shoulder and then some A.R.T."
we could see that the data contains not only words, but also numbers and punctuation. The punctuation may be non-ASCII (Unicode), as the first example in the blogs sample shows (it contains a character “…”, which is different from 3 ASCII point characters “. . .”). Some lines may contain multiple sentences, and probably we have to take this into account.
Here is our plan:
We decided to split text on sentences and do not attempt to predict words across sentence border. We still may use information about sentences to improve prediction of the first word, because the frequency of the first word in a sentence may be very different from an average frequency.
blogs.text <- unlist(tokenizers::tokenize_sentences(blogs.text))
news.text <- unlist(tokenizers::tokenize_sentences(news.text))
twitter.text <- unlist(tokenizers::tokenize_sentences(twitter.text))
Libraries contains some functions for cleaning up and pre-processing, but for some steps we have to write functions ourselves.
# Remove URLs. The regular expression detects http(s) and ftp(s) protocols.
removeUrl <- function(x) gsub("(ht|f)tp(s?)://\\S+", "", x)
# Remove e-mail addresses.
# The regular expression from Stack Overflow:
# https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression
removeEmail <- function(x) gsub("(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|\"(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21\\x23-\\x5b\\x5d-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])*\")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x21-\\x5a\\x53-\\x7f]|\\\\[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f])+)\\])", "", x, perl = TRUE)
# Remove hash tags (the character # and the following word) and twitter handles
# (the character @ and the following word).
removeTagsAndHandles <- function(x) gsub("[@#]\\S+", "", x)
# Surround punctuation marks which does not appear inside a word with space
# characters. Without this step, fragments with a missing space are transformed
# to a single non-existing word when punctuation is removed.
# Example: corpus contains
# "I had the best day yesterday,it was like two kids in a candy store"
# Without this step, "yesterday,it" is transformed to a non-existing word
# "yesterdayit" when removing punctuation. This step transforms it to
# "yesterday, it"
addMissingSpace <- function(x) gsub("[,()\":;”…]", " ", x)
# Replace words in a sentence with replacements available from the table.
# Keep words which are not in the replacement table "as is".
# As a side effect, removes punctuation and transforms to a lower case.
#
# This step is required for several purposes:
# * Replace common short forms with full forms, for example "he'll" = "he will"
replacements.text <- readr::read_csv("replacements.txt",
col_names = c("token", "replacement"),
col_types = list(col_character(), col_character()))
replaceWords <- function(text, replacements) {
# Split text on words.
tokens.orig <- tokenizers::tokenize_words(text, simplify = TRUE,
strip_numeric = TRUE)
# Attempt to replace each word.
tokens.replaced <- sapply(tokens.orig, function(x) {
# Search if a replacement exist.
replacement.index <- match(x, replacements$token)
if (is.na(replacement.index)) {
# Can't find a replacement, fall back on the token itself.
return (x)
} else {
# Replace the token.
return (replacements$replacement[replacement.index])
}
}, USE.NAMES = FALSE)
paste(tokens.replaced, collapse = " ")
}
# Add tokens representing start and end of a sentence.
# SOS = Start Of Sentence
# EOS = End Of Sentence
# When we add these tokens, our text was already transformed to a lower case,
# so we could easy distinguish upper case special tokens from lower case text.
addSentenceTokens <- function(x) paste("SOS", x, "EOS")
# Collapse space characters: if there are more than 1 space character in a row,
# replace with a single one.
collapseWhitespace <- function(x) gsub("\\s+", " ", x)
# ... And now combine all functions in a pre-processing chain.
preProcessText <- function(x) {
text <- removeUrl(x)
text <- removeEmail(text)
text <- removeTagsAndHandles(text)
text <- addMissingSpace(text)
text <- replaceWords(text, replacements.text)
text <- addSentenceTokens(text)
text <- collapseWhitespace(text)
}
Now we pre-process the data.
blogs.text.preprocessed <- unlist(mclapply(blogs.text, preProcessText))
news.text.preprocessed <- unlist(mclapply(news.text, preProcessText))
twitter.text.preprocessed <- unlist(mclapply(twitter.text, preProcessText))
In this section we will study distribution of words in corpora, ignoring for the moment interaction between words (n-grams).
We define two helper functions. The first one creates a Document Feature Matrix (DFM) for n-grams in documents, and aggregates it over all documents to a Feature Vector. The second helper function enriches the Feature Vector with additional values useful for our analysis, such as cumulated coverage of text.
# Calculate Document Feature Matrix (DFM) for n-grams in documents,
# and aggregate it over all documents to a Feature Vector.
build.ngram <- function(text, n, min_freq = 1, stop_words = NULL) {
# Split text on 1-grams.
text.tokens <- tokens(text)
# Remove stop-words, if required.
if (!is.null(stop_words)) {
text.tokens <- tokens_remove(text.tokens, stop_words)
}
# Stem words. We are using an explicit stemming to make sure that it is fully
# compatible with another places later in the code where we have to apply
# the stemming manually.
text.tokens <- as.tokens(
mclapply(text.tokens, function(x) SnowballC::wordStem(x, language = "en")))
# Create n-grams, if n > 1
if (n > 1) {
text.tokens <- tokens_ngrams(text.tokens, n = n, concatenator = " ")
}
# Special case: if our corpus contains empty sentences, than 2-grams contains
# "SOS EOS", that is sequences of "Start-Of-Sentence" + "End-Of-Sentence"
# tokens. This may happens, for example, if a sentence contains only stop
# words, or in some weird cases like a twitter which contains only a time
# like "8:12". We are not interested in empty sequences, so we are removing
# such tokens.
if (n == 2) {
text.tokens <- tokens_remove(text.tokens, c("SOS EOS"))
}
# Calculate the Document Feature Matrix
text.dfm <- dfm(text.tokens, tolower = FALSE)
# Remove from DFM least frequent features, if requested.
if (min_freq > 1) {
text.dfm <- dfm_trim(text.dfm, min_termfreq = min_freq)
}
# Sum over all documents.
colSums(text.dfm)
}
# Sorts a Feature Vector in descending order of frequency and enriches it with
# additional columns:
# * Cumulated frequency of terms (words or n-grams)
# * Cumulated frequency as percentage of total.
enrich.ngram <- function(fv) {
# Transform Feature Vector to a table and sort by frequency descending.
tbl <- data.table(Terms = names(fv), Freq = fv)
tbl <- tbl[order(-Freq)]
# Add columns with cumulative frequency as a number of words, and as percentage.
tbl$Freq.Cum <- cumsum(tbl$Freq)
tbl$Freq.Cum.Pct <- tbl$Freq.Cum / sum(tbl$Freq)
return (tbl)
}
Now we may calculate frequency of words in each source, as well as in all sources together (aggregated).
# Define stop-words for 1-grams: standard stop words, as well as our special
# tokens "Start-Of-Sentence" and "End-Of-Sentence".
stopwords.tokens <- c(stopwords(), "SOS", "EOS")
# Calculate a frequency of words in each sources, as well as an aggregated.
blogs.1gram.freq <- enrich.ngram(build.ngram(blogs.text.preprocessed, 1,
stop_words = stopwords.tokens))
news.1gram.freq <- enrich.ngram(build.ngram(news.text.preprocessed, 1,
stop_words = stopwords.tokens))
twitter.1gram.freq <- enrich.ngram(build.ngram(twitter.text.preprocessed, 1,
stop_words = stopwords.tokens))
all.text.preprocessed <- c(blogs.text.preprocessed, news.text.preprocessed,
twitter.text.preprocessed)
all.1gram.freq <- enrich.ngram(build.ngram(all.text.preprocessed, 1,
stop_words = stopwords.tokens))
The following chart displays 20 most-frequent words in each source, as well as in the aggregated corpora.
As we see from the chart, top 20 most-frequent words differs between sources. For example, the most frequent word in news is “said”, but this word is not included in the top-20 list for blogs and Twitter at all. At the same time, some words are shared between lists: the word “can” is 2nd most-frequent in blogs, 3rd-most frequest in Twitter, and 5th in and news.
Our next step is to analyze the intersection, that is to find how many words are common to all sources, and how many are unique to a particular source. Not only just a number of words is important, but also a source coverage, that is what percentage of the whole text of a particular source is covered by a particular subset of all words.
The following Venn diagram shows a number of unique words (stems) used in each source, as well as a percentage of the aggregated corpora covered by those words.
As we may see, 46686 words are shared by all 3 corpora, but those words cover 97.46% of the aggregated corpora. On the other hand, there are 83185 words unique to blogs, but these words appear very infrequently, covering just 0.43% of the aggregated corpora.
The Venn diagram indicates that we may get a high coverage of all corpora by choosing common words. Coverage by words specific to a particular corpus is negligible.
The next step in our analysis is to find out how many common words we should choose to achieve a decent coverage of the text. From the Venn diagram we already know that by choosing 46686 words we will cover 97.46% of the aggregated corpora, but maybe we may reduce a number of words without significantly reducing the coverage.
The following chart shows a number of unique words in each source which cover particular percentage of the text. For example, 1000 most-frequent words cover 68.09% of the Twitter corpus. An interesting observation is that Twitter requires less words to cover particular percentage of the text, whereas news requires more words.
Corpora Coverage | Blogs | News | Aggregated | |
---|---|---|---|---|
75% | 2,004 | 2,171 | 1,539 | 2,136 |
90% | 6,395 | 6,718 | 5,325 | 6,941 |
95% | 13,369 | 13,689 | 11,922 | 15,002 |
99% | 63,110 | 53,294 | 71,575 | 88,267 |
99.9% | 149,650 | 126,585 | 161,873 | 302,693 |
The table shows that in order to cover 95% of blogs, we require 13,369 words. The same coverage of news require 13,689 words, and the coverage of twitter 11,922 words. To cover 95% of the aggregated corpora, we require 15,002 unique words. We may use this fact later to reduce a number of n-grams required for predictions.
In this section we will study distribution bigrams, that is combinations of two words.
Using previously defined functions, we may calculate frequency of bigrams in each source, as well as in all sources together (aggregated).
# Calculate a frequency of words in each sources, as well as an aggregated.
blogs.2gram.freq <- enrich.ngram(build.ngram(blogs.text.preprocessed, 2,
stop_words = stopwords.tokens))
news.2gram.freq <- enrich.ngram(build.ngram(news.text.preprocessed, 2,
stop_words = stopwords.tokens))
twitter.2gram.freq <- enrich.ngram(build.ngram(twitter.text.preprocessed, 2,
stop_words = stopwords.tokens))
all.text.preprocessed <- c(blogs.text.preprocessed, news.text.preprocessed,
twitter.text.preprocessed)
all.2gram.freq <- enrich.ngram(build.ngram(all.text.preprocessed, 2,
stop_words = stopwords.tokens))
The following chart displays 20 most-frequent bigrams in each source, as well as in the aggregated corpora.
We immediately see a difference with lists of top 20 words: there were much more common words between sources, as there are common bigrams. There are still some common bigrams, but the intersection is smaller.
Similar to how we proceed with words, now we will analyze intersections, that is we will find how many bigrams are common to all sources, and how many are unique to a particular source. We also calculate a percentage of the whole source covered by a particular subset of all bigrams.
The following Venn diagram shows a number of unique bigrams used in each source, as well as a percentage of the aggregated corpora covered by those bigrams.
The difference between words and bigrams is even more pronounced here. Bigrams common to all sources cover just 46.23% of the text, compared to more than 95% covered by words common to all sources.
The next step in our analysis is to find out how many common bigrams we should choose to achieve a decent coverage of the text.
The following chart shows a number of unique bigrams in each source which cover particular percentage of the text. For example, 1000 most-frequent bigrams cover 8.66% of the Twitter corpus.
Corpora Coverage | Blogs | News | Aggregated | |
---|---|---|---|---|
75% | 1,945,493 | 1,810,320 | 1,154,697 | 2,697,841 |
90% | 3,449,146 | 3,393,854 | 2,329,516 | 6,772,302 |
95% | 3,950,364 | 3,921,699 | 2,721,122 | 8,192,971 |
99% | 4,351,338 | 4,343,975 | 3,034,407 | 9,329,506 |
99.9% | 4,441,557 | 4,438,987 | 3,104,896 | 9,585,226 |
The table shows that in order to cover 95% of blogs, we require 3,950,364 bigrams. The same coverage of news require 3,921,699 bigrams, and the coverage of Twitter 2,721,122 bigrams. To cover 95% of the aggregated corpora, we require 8,192,971 bigrams.
The chart is also very different from a similar chart for words. The curve for words had an “S”-shape, that is it’s growth slowed down after some number of words, so that adding more words results in diminishing returns. For bigrams, there is no point of diminishing returns: curves are just rising.
As we have found in the section Analyzing words (1-grams), our corpora contains \(N_1=\) 335,906 unique word stems. Potentially there could be \(N_1^2=\) 112,832,840,836 bigrams, but we have observed only \(N_2=\) 9,613,640, that is 0.0085% of all possible. Still, the number of observed bigrams is pretty large. In the section Analyzing words (1-grams) we have found that we may cover large part of the corpus by relatively small number of unique word stems. In the next section we will see if we may reduce a number of unique 2-grams by utilizing that knowledge.
We have found in the section Analyzing words (1-grams), that our corpora contains \(N_1=\) 335,906 unique word stems, but just 15,002 of them cover 95% of the corpus. In this section we will analyze whether we may reduce a number of bigrams by utilizing that knowledge.
We will replace seldom words with a speial token UNK
. This will reduce a number of bigrams, because different word sequences may now produce the same bigram, if those word sequences contains seldom words. For example, our word list contains names “Andrei”, “Charley” and “Fabio”, but these words do not belong to a subset of most common words required to achieve 95% coverage of the corpus. If our corpus contains bigrams “Andrei told”, “Charley told” and “Fabio told”, we will replace them all with a bigram “UNK told”.
Since we will apply the same approach to 3-grams, 4-grams etc, to save time we prune the corpora once and save results to files which we may load later.
We start by defining a function that accepts a sequence of words, a white-list and a replacement token. All words in the sentence which are not included in the white-list are replaced by the token.
# Replace words not in the provided white-list with a replacement token.
replaceWordsNotIn <- function(text, whitelist, replacement) {
# Split text on words.
tokens.orig <- unlist(tokens(text))
# Stem words.
tokens.stem <- SnowballC::wordStem(tokens.orig, language = "en")
# Check each stem against a whitelist.
tokens.replaced <- mapply(function(word.orig, word.stem) {
# Search if the stem in the whitelist.
replacement.index <- match(word.stem, whitelist)
if (is.na(replacement.index)) {
# The word is not in the whitelist: replace it.
return (replacement)
} else {
# The word in the whitelist: use the original.
return (word.orig)
}
}, tokens.orig, tokens.stem, SIMPLIFY = TRUE, USE.NAMES = FALSE)
paste(tokens.replaced, collapse = " ")
}
Now we create a white-list that contains:
# Calculate 95% of the most common words.
words.95 <- all.1gram.freq[Freq.Cum.Pct <= 0.95]$Terms
# Keep 95% of the most common words, all stopwords, as well as special tokens
# SOS/EOS.
words.whitelist <- c(words.95, stopwords(), "SOS", "EOS")
And now we apply the function defined above to replace all words not included in the white-list with the token UNK
.
blogs.text.95 <- unlist(mclapply(blogs.text.preprocessed,
replaceWordsNotIn,
words.whitelist, "UNK"))
news.text.95 <- unlist(mclapply(news.text.preprocessed,
replaceWordsNotIn,
words.whitelist, "UNK"))
twitter.text.95 <- unlist(mclapply(twitter.text.preprocessed,
replaceWordsNotIn,
words.whitelist, "UNK"))
After pruning seldom words, we re-calculate bigrams. From now on, we will analyze only the aggregated corpus.
# Calculate frequency of bigrams in pruned source, or load from cache.
all.text.95 <- c(blogs.text.95, news.text.95, twitter.text.95)
all.2gram.95.freq <- enrich.ngram(build.ngram(all.text.95, 2,
stop_words = stopwords.tokens))
The chart shows coverage of corpora by pruned bigrams, where different types of bigrams are indicated by different color. The chart also shows for several numbers points where bigrams were encountered a particular number of times. For example, there are 104,625 unique bigrams encountered more than 30 times.
By pruning we have reduced a number of unique bigrams from 9,613,640 to 7,341,432, that is by 23.64%. On this stage it is hard to tell whether pruning makes sense: on one hand, it reduces the number of unique 2-grams and thus memory requirements of our application, on the other hand it removes information which may be required to achieve good prediction rate.
After analyzing bigrams, now there is a time to take a look on a longer n-grams. We decided to analyze 3-grams to 6-grams.
# Calculate frequency of 3- to 6-grams in pruned source.
all.3gram.95.freq <- enrich.ngram(build.ngram(all.text.95, 3,
stop_words = stopwords.tokens))
all.4gram.95.freq <- enrich.ngram(build.ngram(all.text.95, 4,
stop_words = stopwords.tokens))
all.5gram.95.freq <- enrich.ngram(build.ngram(all.text.95, 5,
stop_words = stopwords.tokens))
all.6gram.95.freq <- enrich.ngram(build.ngram(all.text.95, 6,
stop_words = stopwords.tokens))
Charts below show coverage of corpora by pruned 2- to 6-grams, where different color indicates n-grams with a different number of pruned words (UNK
tokens). The same as for 2-grams, charts also show for several number points where n-grams were encountered a particular number of times.
As \(n\) grows, the number of repeated n-grams decreases. This property is quite obvious: for example, there are much more common 2-grams (like “last year” or “good luck”) as common 6-grams. A consequence of this property is less obvious, but is clearly visible on charts: as \(n\) grows, one require more and more unique n-grams to cover the same percentage of the text. For single words we could choose a small subsets that covers 95% of the corpora, but for n-grams achieving a high corpora coverage with a small subset is impossible.
Corpora Coverage | 2-grams | 3-grams | 4-grams | 5-grams | 6-grams |
---|---|---|---|---|---|
25% | 0.27 | 8.40 | 22.18 | 23.75 | 24.07 |
50% | 3.06 | 38.82 | 48.12 | 49.17 | 49.38 |
75% | 20.12 | 69.41 | 74.06 | 74.58 | 74.69 |
95% | 80.34 | 93.88 | 94.81 | 94.92 | 94.94 |
The table above shows a percentage of n-grams required to cover a particular percentage of the aggregated corpus for various n. For example, one require 3.06% of 2-grams to cover 50% of the corpus, but the same coverage requires 38.82% of 3-grams. As we could see, even for 2-grams we can’t significantly reduce a number of unique n-grams without significantly reducing the coverage as well.
Conclusions from the data analysis:
Open questions:
UNK
or should we keep such words?To answer most questions above, we have to create several models and run them against a test data set.
Next steps: