The Internet is full of interesting data, there’s no doubt about it. Some sites, such as Twitter, provide users with systemized access (API) around which some neat
R packages have been built. In this exercise set, we practice much more general techniques of extracting/scraping data from the web directly, using the
Note that it is useful to have some basic understanding of the elements of
xml, such as tags and their attributes, in order to become an effective web scraper. A useful package for identifying relevant tags quickly is SelectorGadget, which is available as an extension to the Chrome browser. Regular expression skills will always come in handy.
Solutions are available here.
Install and load the
rvest package. Use
read_html to read in this webpage as an R object listing and linking to lecture notes for the MIT course Introduction to Algorithms. Name the object
html_nodes(), extract all links from
ln_page and save as
ln_links. It might be helpful to first read up on
html links on w3school.com.
Now extract all the text from the links to
ln_links_text, the path
ln_links_path, and the
href attribute which defines where they lead to. What is the structure of the objects you extracted?
Turns out both are simply character vectors. Knowing that the lecture notes are all in
ln_links_path_pdf. Print some of the paths to the console.
Notice the paths are relative (not absolute.) If we want to download the
Use a loop and
download.file() function to download at least two of the PDFs. Notice you first need to decide what the files will be called on your hard drive (the
destfile argument), and of course define your working directory.
Now that you will be busy studying algorithms, you still don’t want to miss out on new exercise sets on R-exercises.com. So, why not write a script that checks the date of the last post? Using
rvest extract the
.entry-time html nodes.
Dissect the object from the last exercise. Find how many days it’s been since the last post on R-exercises.com.
Now, check who the authors of the newest entries are.
Check what the rating is of your favorite movie on imdb.com. You will probably need to check the source code and/or rely on the SelectorGadget to identify the correct node.
(Image by rosefirerising)