Next Word Prediction: Stupid Backoff

Alexey Serdyuk
July 2019

Next Word Prediction

This is the capstone project for the cycle of courses Data Science Specialization offered on Coursera by Johns Hopkins University.

The purpose of this project is to develop a Shiny application which, given a text, predicts the next word. The prediction algorithm may be re-used as a part of a spelling correction module, or as a typing assistance for mobile devices.

Application: https://serdioa.shinyapps.io/predict-sb

Documentation: https://serdioa.github.io/DataScienceCapstone/ImplementationReport.html

Source code and more on GitHube site: https://github.com/serdioa/DataScienceCapstone

Algorithm

The prediction algorithm is based on n-grams model. Training text corpus is split on 1- to 5-grams (sequences of 1 to 5 words). Each n-gram is divided on a prefix (first n-1 words) and a suffix (the last word). We predict the suffix based on the prefix and frequency of n-grams.

Algorithm consists of two steps: choosing and ranking candidates. We choose as candidates all n-grams which starts with the prefix observed in the input text. We rank candidates using the Stupid Backoff algorithm. A simple extension allows to predict partially entered words as the user types them.

Off-sample tests show that our algorithm correctly predicts the next word in 15.40 % of cases, 95% conf. interval (15.15%, 15.65%). The correct word is amongst top 3 predictions in 25.00% of cases, 95% conf. interval (24.73, 25.26).

Application

Application Screenshot

  • Start typing text into the input box, the prediction will appear below as you type. Alternatively, click “Random Sample” to choose one of more than 1000 examples.

  • You may select the number of predicted candidates (1 to 10).

  • Choose “Predict partially entered words” to predict word endings as you type. In this mode you have to press the space character to predict the next word.

Prediction

Prediction Screenshot

  • The application shows predictions as you type, displaying a chart with ranks of predicted words.

  • The prediction itself takes less than 20 ms, most of the observed delay is due to the Shiny framework and network latency.

  • For more information, click on the “About” menu tab in the application.