The more ubiquitous data becomes, the number of standards and ways the data can get to you in a messy state both increase. I’ve found that many projects I’ve worked on, to my surprise, turned out to need a substantial amount of text processing skills.
Base R has some powerful tools to manipulate strings, but specialized packages are also gaining momentum. stringr’s simple and consistent syntax, makes it a strong alternative to base R functions. It doesn’t hurt that it is written by R rockstar Hadley Wickham, who is also the author of other packages in wide use, such as dplyr, ggplot2.
Solutions are available here.
Load (and install) the
gapminder package. For a warm up, make a new
data.frame based on the
gapminder data with one row per country and two columns. Name the country and the continent it is classified to. Name it simply
stringr function to find out what the average length of the country names are, as they appear in the data-set.
Extract the first and last letter of each country’s name. Make a frequency plot for both. Here you can use base-
What countries have the word “and” as part of their name?
Delete all instances of
"." from the country names.
str_c to generate the vector
c("mouse likes cat very much", "mouse likes cat very very much", "mouse likes cat very very very much").
Imagine you are creating an app to explore the Gapminder data; the tool you are using can only accommodate country names of 12 characters. Therefore, you decide to shorten the names from the right, such that if the country name is longer than 12 characters, you trim it to 11 and add a full stop. Example: “United States” becomes “United Stat.”. Use
str_trunc(), then find a way to reach the same result without it.
sentences is a character vector of 720 sentences that loads to your environment when you load the
stringr package. Extract all two-character words from it and plot their frequency.
Convert the names to lower case and count what characters are the most common in the country names overall.
Only one country has “x” in its name, congrats Mexico! “A” is the most used character. What is the country that takes this the furthest and has the most “a”s in its name?
(Image by Jan)