Pull the Right Strings with stringr: Exercises

By providing a set of wrappers to existing functions, the stringr package allows for simple, consistent and efficient manipulations of strings in R. Even though there are some more basic packages that offer strings-related functions, you might find yourself in need for a more complete and straightforward solution for handling strings in R.

With a simple and consistent syntax, stringr provides some very convenient functions around pattern matching, characters manipulation, whitespace handling and more. The full reference of the package can be found here.

Please find below a set of exercises that will help you practice a variety of stringr functions. The focus is on practical operations that data analysts are required to perform on a daily basis. Answers to the exercises are available here. And, don’t forget to check out our other exercise sets on the stringr package by following the stringr tag.

For the following exercises we will use this data:

addresses <- c("14 Pine Street, Los Angeles", "152 Redwood Street, Seattle", "8 Washington Boulevard, New York")

products <- c(“TV “, ” laptop”, “portable charger”, “Wireless Keybord”, ” HeadPhones “)

long_sentences <- stringr::sentences[1:10]

field_names <- c(“order_number”, “order_date”, “customer_email”, “product_title”, “amount”)

employee_skills <- c(“John Bale (Beginner)”, “Rita Murphy (Pro)”, “Chris White (Pro)”, “Sarah Reid (Medium)”)

Exercise 1
Normalize the addresses vector by replacing capitalized letters with lower-case ones.

Exercise 2
Pull only the numeric part of the addresses vector.

Exercise 3
Split the addresses vector into two parts: address and city. The result should be a matrix.

Exercise 4
Now try to split the addresses vector into three parts: house number, street and city. The result should be a matrix.
Hint: use a regex lookbehind assertion

Exercise 5
In the long_sentences vector, for sentences that start with the letter “T” or end with the letter “s”, show the first or last word respectively. If the sentence both starts with a “T” and ends with an “s”, show both the first and the last words. Remember that the actual last character of a sentence is usually a period.

Learn more about string manipulation with stringr in the online course Learn R by Intensive Practice.

Exercise 6
Show only the first 20 characters of all sentences in the long_sentences vector. To indicate that you removed some characters, use two consecutive periods at the end of each sentence.

Exercise 7
Normalize the products vector by removing all unnecessary whitespaces (both from the start, the end and the middle), and by capitalizing all letters.

Exercise 8
Prepare the field_names for display, by replacing all of the underscore symbols with spaces, and by converting it to the title-case.

Exercise 9
Align all of the field_names to be with equal length, by adding whitespaces to the beginning of the relevant strings.

Exercise 10
In the employee_skills vector, look for employees that are defined as “Pro” or “Medium”. Your output should be a matrix that have the employee name in the first column, and the skill level (without parenthesis) in the second column. Employees that are not qualified should get missing values in both columns.




Regular Expressions Fundamentals – Exercises

Regular expressions is one of the skills you need to drill and drill until they become second nature. You never know when you will need them, just that you WILL need them.

In this exercise set, we will go through some of the fundamentals relying on base R only. If you are already an expert, less than five minutes should suffice. If you have never done this before, allow a couple of hours. All exercises are inspired by real issues.

 

Solutions are available here. If you came up with a different (correct) answer than those listed, please feel free to share in the comments.

Exercise 1

In R‘s inbuilt data-set islands,  extract all area names containing the letter c.

Exercise 2

Extract any sentence mentioning “boy” in this string:

Sam goes to school. Sam comes home and studies. Sam is a good boy.

Exercise 3

Extract the sentence about the “quick brown fox” from the following clip:

A favorite copy set by English teachers, for their pupils, is the following sentence. This is because it contains every letter of the alphabet: ‘A quick brown fox jumps over the lazy dog.’

Exercise 4

Load the stringr package and analyze the sentences data-set that comes with the package. Ignoring the first word, what sentences have a word that starts with an upper case letter?

Learn more about Text Analysis in the online course Text Analytics/Text Mining Using R. In this course, you will learn how create, analyze and finally, visualize your text based data source. Having all the steps easily outlined will be a great reference source for your future work.

Exercise 5

Extract all clips starting with either http or https from this vector.

c("www.dogman.com", "http://rotterdam.com", "https://facebook.com", 
  "httpx://sims.com")

Exercise 6

Extract the street address from:

“Gilroy Plant Place 777 Morello Ave.”

Exercise 7

Get rid of any word containing a number:

“a2c if3 clean 001mn10 string asw21”

Exercise 8
Transform the below data, extracting the two years appearing in each string element. So, as an example, the first element becomes “from 1993 to 2003”.

stri = c("AT0ACH10000700100dymax.1-1-1993.31-12-2003",
         "AT0ILL10000700500dymax.1-1-1990.31-12-2011", 
         "AT0PIL10000700500dymax.1-1-1992.31-12-2011",
         "AT0SON10000700100dymax.1-1-1990.31-12-2011",
         "AT0STO10000700100dymax.1-1-1992.31-12-2006",  
         "AT0VOR10000700500dymax.1-1-1991.31-12-2011",
         "AT110020000700100dymax.1-1-1993.31-12-2008",
         "AT2HE190000700100dymax.1-1-1993.31-12-2000", 
         "AT2KA110000700500dymax.1-1-1991.31-12-2010", 
         "AT2KA410000700500dymax.1-1-1991.31-12-2011")  

Exercise 9

Read the following into R (be careful with the parenthesis and quotation marks) and remove all non-alphabetic characters from it.

I think, sometimes, that my use of commas, and, occasionally, exclamation marks, can be excessive. Whenever I add a word or expression, not necessary, to the sentence, just like I did with the “not necessary” and like I am doing right now, I always include these words, well maybe not always, usually include these inserts between commas; so, basically, I enjoy writing long sentences, joined with lots of commas and, frequently, semi-colons and, often, colons (and have been rather prone to using brackets, as well).

Exercise 10

Extract all two-character words from the sentences data and plot their frequency.

 

(Cartoon by xkcd.com)




Stringr Basic Functions: Exercises


The more ubiquitous data becomes, the number of standards and ways the data can get to you in a messy state both increase. I’ve found that many projects I’ve worked on, to my surprise, turned out to need a substantial amount of text processing skills.

Base R has some powerful tools to manipulate strings, but specialized packages are also gaining momentum. stringr’s simple and consistent syntax, makes it a strong alternative to base R functions. It doesn’t hurt that it is written by R rockstar Hadley Wickham, who is also the author of other packages in wide use, such as dplyr, ggplot2.

Solutions are available here.

Exercise 1
Load (and install) the stringr and gapminder package. For a warm up, make a new data.frame based on the gapminder data with one row per country and two columns. Name the country and the continent it is classified to.  Name it simply df.

Exercise 2
Use a stringr function to find out what the average length of the country names are, as they appear in the data-set.

Exercise 3
Extract the first and last letter of each country’s name. Make a frequency plot for both. Here you can use base-R function table.

Exercise 4
What countries have the word “and” as part of their name?

Exercise 5
Delete all instances of "," and "." from the country names.

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 6
Use str_dup and str_c to generate the vector c("mouse likes cat very much", "mouse likes cat very very much", "mouse likes cat very very very much").

Exercise 7
Imagine you are creating an app to explore the Gapminder data; the tool you are using can only accommodate country names of 12 characters. Therefore, you decide to shorten the names from the right, such that if the country name is longer than 12 characters, you trim it to 11 and add a full stop. Example: “United States” becomes “United Stat.”. Use str_trunc(), then find a way to reach the same result without it.

Exercise 8
sentences is a character vector of 720 sentences that loads to your environment when you load the stringr package. Extract all two-character words from it and plot their frequency.

Exercise 9
Convert the names to lower case and count what characters are the most common in the country names overall.

Exercise 10
Only one country has “x” in its name, congrats Mexico! “A” is the most used character. What is the country that takes this the furthest and has the most “a”s in its name?

(Image by Jan)




Text Data Wrangling: Exercises

In a previous exercise set, we practiced retrieving data from Twitter. In this exercise, we start getting comfortable with manipulating text data.

We will start by refreshing our memory on how to use some base-R functions, then we start using the tm package.

Answers to the exercises are available here.

Exercise 1
Use readLines to download the short story “Hansel and Gretel” from textfiles.com. Save it to an object called hs.

Exercise 2
Use strsplit() to convert hs to a vector, where each element is a single word.

Exercise 3
Make all the words lowercase.

Exercise 4
Using regular expressions, get rid of punctuation (or non-letters). Then, notice that some elements are empty. Get rid of those. Can you see why it can be important to take these steps in the right order?

Exercise 5
Create a table counting the number of times each word appears. Then, make a frequency plot for the 15 most common words.

Exercise 6
Notice that the most common words “the”, “a”, “and”, “to”, etc. do not shed much light on the content of the story, since these are going to appear in any English text. Install and load the tm package and remove all the “stop words” provided by stopwords("en"). Since we’ve already removed punctuation from our vector, we need to do the same for the stop words vector stopwords("en"). Now repeat the steps from exercise 5.

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 7
Sort hs in alphabetical order and take a look at the vector. First, you can see that, due to a typo in the original text file, there is one element which is a meaningless letter. Simply get rid of all one-letter words. Second, you’ll see that some words are, in essence, almost the same, such as open and opened or woodcutter and woodcutters. Use a tm function to stem the remaining words. Note that first, you need to load a package SnowballC. Repeat the plot.

Exercise 8
Reload the original text file as in exercise 1. Drop the first two lines and separate the file into a character vector of four elements. The first element will contain the text from line 1 to 26, the next 27 to 52, etc.

Exercise 9
Convert the vector into a tm Corpus – a special object for a collection of documents. Repeat some of the earlier cleaning steps by relying on tm_map, specifically: remove capital letters, remove stop words, remove punctuation, and stem the four documents. Finally, print the cleaned content of document two ([[2]][1]) in the corpus.

Exercise 10
Create a term-document matrix based on the corpus, and get a simple correlation matrix between the four sections.




More string Hacking with Regex and Rebus

For a begineer in R or any language,regular expression might seem like a daunting task . Rebus package in R gives a lowers the barrier for common regular expression tasks and is useful for a begineer or even for advanced users for most of the common regex skills in a more intuitive yet verbose way .Check out the package and try this exercises to test your knowledge .
Load stringr/stringi as well for this set of exercise . I encourage you to do this and
this before working on this set .
Answers are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
create two strings
Suppose you have a vectore
x <- c("stringer","stringi","rebus","redbus")

use rebus and find the strings starting with st .Hint use START from rebus

Exercise 2

Use the same string vectore and find the strings which ends with bus.

Exercise 3
you have a vector like
m <- c("aba","aca","abba","accdda")

find the strings which starts and ends with a and have a single character in between
Hint – use ANY_CHAR
Exercise 4
y <- c("brain","brawn","rain","train")

find all the strings that starts with br and ends with n .
Hint – use any_char with hi=Inf to build the regex

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 5
Use the same vector as previous exercise and find strings starting with br or tr .
Hint – or

Exercise 6
Now we turn our attention to character class,if you are familiar with character classes in regex , you will find it pretty easy with rebus and if you are starting with regex .you might find it easy to remember with rebus
Suppose you have a vector
l <- c("Canada","america","france")

Find string with C or m in it .so your answer should be Canada and America

Exercise 7
From the string 123abc ,find the digits ,using rebus .

Exercise 8
Create a character class for vowels and find all the Vowels in the vector
vow <- c("blue","sue","CLUE","TRUE")

Exercise 9
Find the characters other than vowels from above vector .

Exercise 10
Now create a new vector
vow1 <- c("blue","sue","CLUE","TRUE","aue")

find the string which is made of only vowels




Hacking Strings with stringi

In the last set of exercises, we worked on the basic concepts of string manipulation with stringr. In this one we will go further into hacking strings universe and learn how to use stringi package.Note that stringi acts as a backend of stringr but have many more useful string manipulation functions compared to stringr and one should really know stringi for text manipulation .

Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
create two strings
c1 <- "a quick brown fox jumps over a lazy dog"
c2 <- "a quick brown fox jump over a lazy dog"
Now stringi comes with many functions and wrappers around functions to check if two string are equivalent. Check if they are equivalent with
stri_compare, %s<=% and try to reason about the answers.

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 2

How would you find no of words in c1 and c2 . Its pretty easy with stringi.Find it out .

Exercise 3

Similarly How would you find all words in c1 and c2 . Again its pretty straight forward with stringi.Find it out .

Exercise 4
Lets say you have a vector which contains famous mathematicians
genius <- c(Godel,Hilbert,Cantor,Gauss, Godel, Fermet,Gauss)
Find the duplications .

Exercise 5

Find the number of characters in genius vector by stri function.

Exercise 6
Its important to keep the character’s of a set of strings in same encoding .Suppose you have a vector
Genius1 <- c("Godel","Hilbert","Cantor","Gauss", "Gödel", "Fermet","Gauss")
Now basically Godel and Gödel are same person but the encoding of the characters are different . but if you try to compare them in a naive way they will act as different .So for the sake of consistency,we should really translate it to similar encoding .Find it how .

Hint – use “Latin-ASCII” transliterator in stri_trans* like function.

Exercise 7
How do we collapse the LETTER vector in R such that it looks like this
“A-B_C-D_E-F_G-H_I-J_K-L_M-N_O-P_Q-R_S-T_U-V_W-X_Y-Z_”

Exercise 8
Suppose you have a string of words like c1 that we have created earlier . You might want to know the starting and end index of the first word, last word.which is obvious for start index of first word and last word but not so obvious for the end index of first word and start index of last word. How would you find this .

Exercise 9
Suppose I have a string
pun <- "A statistician can have his head in an oven and his feet in ice, and he will say that on the average he feels fine"
Suppose I want to replace statistician and average with mathematician and median in the string pun .How can I achieve that .
Hint -use a stri_replace* method.

Exercise 10
My string x is like
x <- "I AM SAM. I AM SAM. SAM I AM"
replace last SAM with ADAM.




Hacking strings with stringr


This is first of the set of exercise on string manipulation with stringr

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
use a stringr function to merge this 3 strings .
x <- "I AM SAM. I AM SAM. SAM I AM"
y <- "THAT SAM-I-AM! THAT SAM-I-AM! I DO NOT LIKE THAT SAM-I-AM!"
z <- ""DO WOULD YOU LIKE GREEN EGGS AND HAM?"

Exercise 2

Now use a vector which contains x,y,z and NA and make it a single sentence using paste ,do the same by the same function you used for exercise1 .Can you spot the difference .

Exercise 3

Install the babynames dataset ,find the vector of length of the babynames using stringr functions. You may wonder nchar can do the same so why not use that ,try finding out the difference and let me know in the comments.

Exercise 4

We often use substr to get part of the string ,in stringr world there exist a much powerful function which does almost the same thing . Create a string name with your name .
Use str_sub to get the last character and the last 5 characters .

Exercise 5

In mtcars dataset rownames, find all cars of the brand Merc .

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 6

Use the same mtcars rownames ,find the total number of times “e” appears in that .

Exercise 7

Suppose you have a string like this
j <- "The_quick_brown_fox_jumps_over_the_lazy_dog"
split it in words using a stringr function

Exercise 8

On the same string I need the first word splitted but the rest intact ,help me to achieve that

Exercise 9

Now for on the same string J
a> I want the first “_” replaced by “–”
b> I want all the “_” replaced by “–”

Exercise 10

Many of the times ,you don’t want NA to appear when you do some string manipulation but its sometimes necessary to replace NA as a character(rather than remove it) ,stringr provides a useful tool for that.
Now if I have a vector like this ,
na_string_vec <- c(“The_quick_brown_fox_jumps_over_the_lazy_dog”,NA)

How can you turn the NA into a character string .




Character Functions (Advanced)

This set of exercises will help you to help you improve your skills with character functions in R. Most of the exercises are related with text mining, a statistical technique that analyses text using statistics. If you find them interesting I would suggest checking the library tm, this includes functions designed for this task. There are many applications of text mining, a pretty popular one is the ability to associate a text with his or her author, this was how J.K.Rowling (Harry potter author) was caught publishing a new novel series under an alias. Before proceeding, it might be helpful to look over the help pages for the nchar, tolower, toupper, grep, sub and strsplit. Take at the library stringr and the functions it includes such as str_sub.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Before starting the set of exercises run the following code lines :

if (!'tm' %in% installed.packages()) install.packages('tm')
library(tm)
txt = system.file("texts", "txt", package = "tm")
ovid = VCorpus(DirSource(txt, encoding = "UTF-8"),
readerControl = list(language = "lat"))
OVID = c(data.frame(text=unlist(TEXT), stringsAsFactors = F))
TEXT = lapply(ovid[1:5], as.character)
TEXT1 = TEXT[[4]]

Exercise 1

Delete all the punctuation marks from TEXT1

Exercise 2

How many letters does TEXT1 contains?

Exercise 3

How many words does TEXT1 contains?

Exercise 4

What is the most common word in TEXT1?

Learn more about Text analysis in the online course Text Analytics/Text Mining Using R. In this course you will learn how create, analyse and finally visualize your text based data source. Having all the steps easily outlined will be a great reference source for future work.

Exercise 5

Get an object that contains all the words with at least one capital letter (Make sure the object contains each word only once)

Exercise 6

Which are the 5 most common letter in the object OVID?

Exercise 7

Which letters from the alphabet are not in the object OVID

Exercise 8

On the OVID object, there is a character from the popular sitcom ‘FRIENDS’ , Who is he/she?  There were six main characters (Chandler, Phoebe, Ross, Monica, Joey, Rachel)

Exercise 9

Find the line where this character is mentioned

Exercise 10

How many words finish with a vowel, how many with a consonant?




Introduction to Text Mining Exercises

For those in the consulting industry, it is very common to be assigned to documentation analysis, specially in the beginning of new projects. Because documentation is often very extensive, text mining is usually a great tool. In this exercise we will use some R packages dedicated to this kind of tasks.

Answers to the exercises are available here.

Exercise 1
Create a vector of statements. In order to do that, use those statements of the introduction of the r-exercise “Data Science for Operational Excellence (Part-5)”.

Exercise 2
Transform this vector into a data frame. Create a first column called Item to put a sequence in those statements.

Exercise 3
How many times the string predict appears?

Exercise 4
What are the sentences where predict appears?

Exercise 5
In witch Itens does predict appears?

Exercise 6
Count the number of times the word predict appears. Count the number of times the word forecast appears. Calculate the proportions of predict/forecast on this text.

Exercise 7
Break those sentences in words and create a new data frame that the word that repeats the most appears in the first line.

Exercise 8
As you can see, there are many words that are irrelevant, like “to”, “the”. Take this words off using data(stop_words).

Exercise 9
Download the book 768 from the library(gutenbergr).

Exercise 10
Please, ignore the stopping words, and find out the words that appear in descending order.




Character Functions (Intermediate)

This set of exercises will help you to lean and test your skill with character functions in R. Before proceeding, it might be helpful to look over the help pages for the nchar, tolower, toupper, grep, sub and strsplit. Take at the library stringr and the functions it includes such as str_split and str_split_fixed. Finally check some basic examples of regular expressions in R.

Answers to the exercises are available here

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Hello World! Using a single code line print a string that says ‘Hello world! I’m ready to start.’ With each phrase in a different line. Note: \n is how you tell R that you want to use a new line for the next strings.

Exercise 2
Run the following line code
states = rownames(USArrests)

The variable states has the name of all the USA states. In a variable called States save all the USA states using lower cases.

We will use States on the following exercises.

Exercise 3

  • Get all the elements in states that start with the letter m.
  • Get all the elements in states that start with the letter m including elements like New Mexico where the second word starts with m.
  • Get all the elements in states

Exercise 4

Use the function grep to get all the states that have only one word.

Exercise 5

Using nchar find how many characters has the longest element in states

Which elements in states

Exercise 6

White spaces shouldn’t count, for this time save in an object called Count_no_spaces the number of characters of every element in states excluding the white spaces.

Exercise 7
Excluding the white spaces find the longest and shortest element in states

Exercise 8

Using the function grep get all the states that have at least two words and save it in States_2words

Exercise 9

Load the library stringr and use the function str_split_fixed to save in a different vector the first word of every element of States_2words and in another the second word.

Exercise 10

Use the function paste to create a vector with all the possible combinations of the first word with the second word and use the function unique to make sure that you got all the possible combination without repeating an element. Note: You should have 40 elements in total.