The more ubiquitous data becomes, the number of standards and ways the data can get to you in a messy state both increase. I’ve found that many projects I’ve worked on, to my surprise, turned out to need a substantial amount of text processing skills.
Base R has some powerful tools to manipulate strings, but specialized packages are also gaining momentum. stringr’s simple and consistent syntax, makes it a strong alternative to base R functions. It doesn’t hurt that it is written by R rockstar Hadley Wickham, who is also the author of other packages in wide use, such as dplyr, ggplot2.
Solutions are available here.
Exercise 1
Load (and install) the stringr
and gapminder
package. For a warm up, make a new data.frame
based on the gapminder
data with one row per country and two columns. Name the country and the continent it is classified to. Name it simply df
.
Exercise 2
Use a stringr
function to find out what the average length of the country names are, as they appear in the data-set.
Exercise 3
Extract the first and last letter of each country’s name. Make a frequency plot for both. Here you can use base-R
function table
.
Exercise 4
What countries have the word “and” as part of their name?
Exercise 5
Delete all instances of ","
and "."
from the country names.
Exercise 6
Use str_dup
and str_c
to generate the vector c("mouse likes cat very much", "mouse likes cat very very much", "mouse likes cat very very very much")
.
Exercise 7
Imagine you are creating an app to explore the Gapminder data; the tool you are using can only accommodate country names of 12 characters. Therefore, you decide to shorten the names from the right, such that if the country name is longer than 12 characters, you trim it to 11 and add a full stop. Example: “United States” becomes “United Stat.”. Use str_trunc()
, then find a way to reach the same result without it.
Exercise 8
sentences
is a character vector of 720 sentences that loads to your environment when you load the stringr
package. Extract all two-character words from it and plot their frequency.
Exercise 9
Convert the names to lower case and count what characters are the most common in the country names overall.
Exercise 10
Only one country has “x” in its name, congrats Mexico! “A” is the most used character. What is the country that takes this the furthest and has the most “a”s in its name?
(Image by Jan)
Exercice 6 : one “very” is missing for 3rd string of the vector to obtain
Thanks woodspock. It’s fixed now.