The dplyr is an R-package that is used for transformation and summarization of tabular data with rows and columns.
It includes a set of functions that filter rows, select specific columns, re-order rows, adds new columns and summarizes data.
Moreover, dplyr contains a useful function to perform another common task, which is the “split-apply-combine” concept.
Compared to base functions in R, the functions in dplyr have an advantage in the sense that they are easier to use, more consistent in the syntax, and aim to analyze data frames instead of just vectors.
PACKAGE INSTALLATION & DATA FRAME
The first thing you have to do is install and load the dplyr package. Moreover, we need a dataset to work with. The dataset we chose in our case is “Iris”, which is the famous (Fisher’s or Anderson’s) iris data set. It gives the measurements in centimeters of the variables sepal length and width, and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
You can use
head(iris) in order to see the variables of your dataset.
Important Dplyr Functions
Function = Description
select() = Select columns
filter() = Filter rows
arrange() = Re-order or arrange rows
mutate() = Create new columns
summarize() = Summarize values
group_by() = Allows for group operations in the “split-apply-combine” concept
Select Columns With Select()
To select a set of columns we can use:
filter(iris, Sepal.Length >= 4.6)
To select all the columns except a specific column, use the “-“ (subtraction) operator.
To select a range of columns by name, use the “:” (colon), operator
To select all columns that start with the character string “S”, use the function starts_with()
You can also select columns based on specific criteria with:
ends_with() = Select columns that end with a character string
contains() = Select columns that contain a character string
matches() = Select columns that match a regular expression
one_of() = Select columns names that are from a group of names
Select Rows With Filter()
To filter the rows for Sepal.Length >= 4.6, you can use:
filter(iris, Sepal.Length >= 4.6)
To filter the rows for Sepal.Length >= 4.6 and Petal.Width >= 0.5, you can use:
filter(iris, Sepal.Length >= 4.6, Petal.Width>=0.5)
Pipe Operator: %>%
Now we are going to talk about the pipe operator:
%>%. This operator allows you to pipe the output from one function to the input of another function.
Let’s pipe the iris data frame to the function that will select two columns (Sepal.Width and Sepal.Length). Then it will pipe the new data frame to the function head(), which will return the head of the new data frame.
select(Sepal.Length, Sepal.Width) %>%
Arrange Rows With Arrange()
To arrange rows by a particular column, such as the Sepal.Width, list the name of the column you want to arrange the rows by.
iris %>% arrange(Sepal.Width) %>% head
Now, we will select three columns from iris, arrange the rows by Sepal.Length, then arrange the rows by Sepal.Width. Finally, we will show the head of the final data frame.
select(Species, Sepal.Length, Sepal.Width) %>%
arrange(Sepal.Length, Sepal.Width) %>%
Create New Columns With Mutate()
mutate() function will add new columns to the data frame. Create a new column called proportion, which is the ratio of Sepal.Length to Sepal.Width.
mutate(proportion = Sepal.Length / Sepal.Width) %>%
Create Summaries of the Data Frame With Summarize()
summarize() function will create summary statistics for a given column in the data frame. For example, to compute the average number of Sepal.Length, apply the
mean() function to the column Sepal.Length and call the summary value “avg_slength”.
summarize(avg_slength = mean(Sepal.Length))
Group Operations Using Group_by()
group_by() verb is an important function in dplyr. We want to split the data frame by some variable (e.g. Sepal.Length), apply a function to the individual data frames, and then combine the output.
We split the iris data frame by the Sepal.Length, then ask for the same summary statistics as above.
summarise(avg_slength = mean(Sepal.Length),
min_slength = min(Sepal.Length),
max_slength = max(Sepal.Length),
total = n())