With Smart Phones being so prevalent and almost ever one using SMS and WhatsApp, one must have seen the use of Next Word Prediction Application. The application functions as follows: As you type in your text, the application predicts and suggests what your next word in the sentence should be.
There are many latest technologies available to generate these applications like Word2Vec. You can also explore GenSim for this purpose.
I provide a simple solution for creating this application. I use R Programming for creating this solution.
Demonstration of the Tool
Looking at the Application Work
The look and feel of this display is very basic as it is used to demonstrate the logic only. The logic has been built into a Shiny Application. The program can be expanded further to other applications.
The Source Data
The first step to creating this solution is to obtain sizeable amount of text has to be gathered. This predictor is for making prediction in English language. So, I needed a lot of text in English language.
I got his data from websites of Newspapers. Using this data, I formed the corpus. (If this is unknown territory for you, please read appropriate sources for this knowledge.)
Create the n-grams
After obtaining the data, the next step is to create the n-grams. In the fields of computational linguistics and probability, an n–gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n–grams typically are collected from a text or speech corpus.
n-grams ca be created using the RWeka package in R (If this is unknown territory for you, please read appropriate sources for this knowledge).
I created n-grams for 2-gram, 3-gram, 4-gram, 5-gram, 6-gram, 7-gram and 8-gram. I stored the n-grams in R Data Store. The basic purpose of this to match the given input text to see if there exists this sequence in our existing knowledge base and what were the subsequent words used previously with their probabilities.
The Prediction Logic
Step 1: Load the saved n-grams
gram2 <- readRDS("2gram.rds") gram3 <- readRDS("3gram.rds") gram4 <- readRDS("4gram.rds") gram5 <- readRDS("5gram.rds") gram6 <- readRDS("6gram.rds") gram7 <- readRDS("7gram.rds") gram8 <- readRDS("8gram.rds")
Step 2: Load the text and form the corpus and clean the text
The text is the existing set of words based on which the next word has to be predicted.
text <- input$inputText mydata.corpus <- Corpus(VectorSource(text)) mydata.corpus <- tm_map(mydata.corpus,content_transformer(function(x) iconv(x, to='ASCII', sub=' '))) mydata.corpus <- tm_map(mydata.corpus,content_transformer(tolower)) mydata.corpus <- tm_map(mydata.corpus, content_transformer(removeNumbers)) mydata.corpus <- tm_map(mydata.corpus, content_transformer(removePunctuation)) mydata.corpus <- tm_map(mydata.corpus, content_transformer(stripWhitespace)) mydata.corpus <- tm_map(mydata.corpus, PlainTextDocument) mydata.corpus <- tm_map(mydata.corpus, content_transformer(function(x) stri_trans_tolower(x))) mydata.corpus <- tm_map(mydata.corpus, content_transformer(function(x) stri_trans_general(x, "en_US")))
Step 3: Collate the fragments of the text from the corpus
frase <- unlist(mydata.corpus[[1]]$content) prev <- unlist (strsplit (frase, split = " ", fixed = TRUE)) len <- length(prev) fra2 <- paste(tail (prev, 1), collapse = " ") fra3 <- paste(tail (prev, 2), collapse = " ") fra4 <- paste(tail (prev, 3), collapse = " ") fra5 <- paste(tail (prev, 4), collapse = " ") fra6 <- paste(tail (prev, 5), collapse = " ") fra7 <- paste(tail (prev, 6), collapse = " ") fra8 <- paste(tail (prev, 7), collapse = " ")
Step 4: Make the prediction
The predicted word is stored in the variable predict.
predict <- NULL try(pred8 <- gram8 [context == fra8, .SD [which.max (p), word]]) try(pred7 <- gram7 [context == fra7, .SD [which.max (p), word]]) try(pred6 <- gram6 [context == fra6, .SD [which.max (p), word]]) try(pred5 <- gram5 [context == fra5, .SD [which.max (p), word]]) try(pred4 <- gram4 [context == fra4, .SD [which.max (p), word]]) try(pred3 <- gram3 [context == fra3, .SD [which.max (p), word]]) try(pred2 <- gram2 [context == fra2, .SD [which.max (p), word]]) if(length(head(pred8))==0){ if(length(head(pred7))==0){ if(length(head(pred6))==0){ if(length(head(pred5))==0){ if(length(head(pred4))==0){ if(length(head(pred3))==0){ if(length(head(pred2))!=0){predict<-pred2 }else{ predict<-"Next word cannot be predicted." } }else{predict<-pred3} }else{predict<-pred4} }else{predict<-pred5} }else{predict<-pred6} }else{predict<-pred7} }else{predict<-pred8}
You must be logged in to post a comment.