In general data analysis includes four parts: Data collection, Data manipulation, Data visualization and Data Insights. The tidyr package is one of the most useful packages for the second category of data manipulation as tidy data is the number one factor for a succesfull analysis.
Tidy data means that every column stands for a variable and every row stands for an observation. In this tutorial we are going to explore four key functions that tidyr provides and then visualize the result with the help of ggplot2
PACKAGE INSTALLATION & DATA FRAME
The first thing you have to do is install and load the tidyr package with:
install.packages("tidyr")
library(tidyr)
Moreover we need some data to work with. We will create an experimental data frame in order to use later in our examples.
nba <- data.frame(
player = c("James", "Durant", "Curry", "Harden", "Paul", "Wade"),
team = c("CLEOH", "GSWOAK", "GSWOAK", "HOUTX", "HOUTX", "CLEOH"),
day1points = c("25","23","30","41","26","20"),
day2points = c("24","25","33","45","26","23")
)
print(nba)
gather()
We use gather()
in order to to take columns and gather them into key-value pairs. Look at the example below:
nba %>%
gather(day, points, c(day1points, day2points))
Specifically, we took the columns “day1points” and “day2points”, which represent the variable “day” and the variable “points”, and gathered them, as one variable has to have one column and one observation one row.
spread()
The spread()
function takes different levels of a factor and spreads them into different columns. Imagine it like a reverse of gather()
.
nba %>%
gather(day, points, c(day1points, day2points))%>%
spread(day, points)
separate()
The separate()
function takes values inside a column and separates them. As you have probably seen by now the column “team” is a little bit weird. There are obviously more than one variables in there and we have to fix it. The team and team’s US state are our variables, so we have to create two columns one for the team and the other for the state.
nba %>%
gather(day, points, c(day1points, day2points)) %>%
separate(col = team, into = c("Team", "State"), sep = 3)
As you have probably guessed the sep
argument indicates the exact letter of the observation in which we want to separate the data.
unite()
The unite()
functions is the reverse of the separate()
.Unite does the opposite of separate()
as it unites two columns into one.
nba %>%
gather(day, points, c(day1points, day2points)) %>%
separate(col = team, into = c("Team", "State"), sep = 3) %>%
unite(teamstate, Team, State)
Visualization
Now that our dataset is tidy at last, let’s visualize what we created with ggplot2. Install and call it with:
install.packages("ggplot2")
library(ggplot2)
nba %>%
gather(day, points, c(day1points, day2points)) %>%
separate(col = team, into = c("Team", "State"), sep = 3)%>%
ggplot(aes(x = day, y = points)) +
geom_point()+
facet_wrap(~ Team) +
geom_smooth(method = "lm",aes(group = 1), se = F)
As you can see we have created three plots, one for every team (facet_wrap
), which have “day” in x-axis and “points” in y-axis. Every plot is separated into 2 columns one for points of day 1 and the other for the points of day 2(ggplot(aes(x = day, y = points))
).
what is the significance of writing nba %>% before gathering
By writing it before gathering you state the dataset that you will use in order to use its variables without having to use nba every time you use one of its variables