The rvest package allows for simple and convenient extraction of data from the web into R, which is often called “web scraping.” Web scraping is a basic and important skill that every data analyst should master. You’ll often see it as a job requirement.
In the following exercises, you will practice your scraping skills on the “Money” section of the CNN website. All of the main functions of the rvest package will be used. Answers to these exercises are available here.
Since websites are constantly changing, some of the solutions might grow to be outdated with time. If this is the case, you are welcome to inform the author and the relevant sections will be updated.
Read the HTML content of the following URL with a variable called
At this point, it will also be useful to open this web page in your browser.
Get the session details (status, type, size) of the above mentioned URL.
Extract all of the sector names from the “Stock Sectors” table (bottom left of the web page.)
Extract all of the “3 Month % Change” values from the “Stock Sectors” table.
Extract the table “What’s Moving” (top middle of the web page) into a data-frame.
Re-construct all of the links from the first column of the “What’s Moving” table.
Hint: the base URL is “https://money.cnn.com”
Extract the titles under the “Latest News” section (bottom middle of the web page.)
To understand the structure of the data in a web page, it is often useful to know what the underlying attributes are of the text you see.
Extract the attributes (and their values) of the HTML element that holds the timestamp underneath the “What’s Moving” table.
Extract the values of the blue percentage-bars from the “Trending Tickers” table (bottom right of the web page.)
Hint: in this case, the values are stored under the “class” attribute.
Get the links of all of the “svg” images on the web page.