Regular expressions is one of the skills you need to drill and drill until they become second nature. You never know when you will need them, just that you WILL need them.
In this exercise set, we will go through some of the fundamentals relying on base
R only. If you are already an expert, less than five minutes should suffice. If you have never done this before, allow a couple of hours. All exercises are inspired by real issues.
Solutions are available here. If you came up with a different (correct) answer than those listed, please feel free to share in the comments.
R‘s inbuilt data-set
islands, extract all area names containing the letter
Extract any sentence mentioning “boy” in this string:
Sam goes to school. Sam comes home and studies. Sam is a good boy.
Extract the sentence about the “quick brown fox” from the following clip:
A favorite copy set by English teachers, for their pupils, is the following sentence. This is because it contains every letter of the alphabet: ‘A quick brown fox jumps over the lazy dog.’
stringr package and analyze the
sentences data-set that comes with the package. Ignoring the first word, what sentences have a word that starts with an upper case letter?
Extract all clips starting with either
https from this vector.
c("www.dogman.com", "http://rotterdam.com", "https://facebook.com", "httpx://sims.com")
Extract the street address from:
“Gilroy Plant Place 777 Morello Ave.”
Get rid of any word containing a number:
“a2c if3 clean 001mn10 string asw21”
Transform the below data, extracting the two years appearing in each string element. So, as an example, the first element becomes “from 1993 to 2003”.
stri = c("AT0ACH10000700100dymax.1-1-1993.31-12-2003", "AT0ILL10000700500dymax.1-1-1990.31-12-2011", "AT0PIL10000700500dymax.1-1-1992.31-12-2011", "AT0SON10000700100dymax.1-1-1990.31-12-2011", "AT0STO10000700100dymax.1-1-1992.31-12-2006", "AT0VOR10000700500dymax.1-1-1991.31-12-2011", "AT110020000700100dymax.1-1-1993.31-12-2008", "AT2HE190000700100dymax.1-1-1993.31-12-2000", "AT2KA110000700500dymax.1-1-1991.31-12-2010", "AT2KA410000700500dymax.1-1-1991.31-12-2011")
Read the following into
R (be careful with the parenthesis and quotation marks) and remove all non-alphabetic characters from it.
I think, sometimes, that my use of commas, and, occasionally, exclamation marks, can be excessive. Whenever I add a word or expression, not necessary, to the sentence, just like I did with the “not necessary” and like I am doing right now, I always include these words, well maybe not always, usually include these inserts between commas; so, basically, I enjoy writing long sentences, joined with lots of commas and, frequently, semi-colons and, often, colons (and have been rather prone to using brackets, as well).
Extract all two-character words from the
sentences data and plot their frequency.
(Cartoon by xkcd.com)