The capstone project consists of applying data science in the area of natural language
processing. The data, a collection of text documents also known as corpus, have been collected from different webpages and different types of sources. In particular, for this project, we analyze corpora in american english from three distinct sources: twitter, blogs and news media sites. The files have been language filtered but may still contain some foreign text.
Below, we report the major features of the data that have been identified and we briefly summarize the plans for creating a prediction algorithm.
Data Exploration
The data consists of three different collection of texts or corpora, en_US.blogs.txt taken from blogs, en_US.news.txt from news sites and en_US.twitter.txt from twitter. In the following we will refer to these datafiles or corpora simply by its source, i.e. blogs, news and twitter. The following table presents word and line count for each of the three corpora.
We use the tm framework for cleaning the documents and the RWeka package for tokenization. Given the sheer amount of data contained in the three corpora, the analysis is performed in a random sample of approximately 1% of the total data.
In the process of cleaning and preprocessing the data, the following transformations were performed:
each word was converted to lower-case, e.g. Tree to tree
removal of URLs, e.g. www.someaddress.com
removal of punctuation: commas, colons, semicolons, dashes, parentheses
removal of numbers
removal of stopwords, such as a, the, to, etc.
After cleaning and processing the text databases, the number of unique words is as follows:
Below we show the top ten most frequent words for each of the corpora.
The number of unique words needed to cover the entire language (sample).
To aquire acceptable accuracy of the prediction algorithm, it is important to obtain the frequency of pair of words or bigrams. Accuracy is improved by further incorporating into the model frequencies of longer word sequences, three-, four- or n-grams.
Discussion
There are subtle differences in the three databases. Therefore it seems that the optimal prediction algorithm would contain a weighted average of the n-grams obtained from the three sources at our disposal. Accuracy can be improved as well with more detailed cleaning of the data and by adjusting the processing according to the type of source of the data. Stop words will be included back to be able to predict without adding too much complexity to the prediction algorithm.