In a previous exercise set, we practiced retrieving data from Twitter. In this exercise, we start getting comfortable with manipulating text data.
We will start by refreshing our memory on how to use some base-
R functions, then we start using the
Answers to the exercises are available here.
Use readLines to download the short story “Hansel and Gretel” from textfiles.com. Save it to an object called
strsplit() to convert
hs to a vector, where each element is a single word.
Make all the words lowercase.
Using regular expressions, get rid of punctuation (or non-letters). Then, notice that some elements are empty. Get rid of those. Can you see why it can be important to take these steps in the right order?
Create a table counting the number of times each word appears. Then, make a frequency plot for the 15 most common words.
Notice that the most common words “the”, “a”, “and”, “to”, etc. do not shed much light on the content of the story, since these are going to appear in any English text. Install and load the
tm package and remove all the “stop words” provided by
stopwords("en"). Since we’ve already removed punctuation from our vector, we need to do the same for the stop words vector
stopwords("en"). Now repeat the steps from exercise 5.
hs in alphabetical order and take a look at the vector. First, you can see that, due to a typo in the original text file, there is one element which is a meaningless letter. Simply get rid of all one-letter words. Second, you’ll see that some words are, in essence, almost the same, such as open and opened or woodcutter and woodcutters. Use a
tm function to stem the remaining words. Note that first, you need to load a package
SnowballC. Repeat the plot.
Reload the original text file as in exercise 1. Drop the first two lines and separate the file into a character vector of four elements. The first element will contain the text from line 1 to 26, the next 27 to 52, etc.
Convert the vector into a
tm Corpus – a special object for a collection of documents. Repeat some of the earlier cleaning steps by relying on
tm_map, specifically: remove capital letters, remove stop words, remove punctuation, and stem the four documents. Finally, print the cleaned content of document two (
[]) in the corpus.
Create a term-document matrix based on the corpus, and get a simple correlation matrix between the four sections.