Factor variables in R can be mind-boggling. Often, you can just avoid them and use characters vectors instead – just don’t forget to set stringsAsFactors=FALSE
. They are, however, very useful in some circumstances, such as statistical modelling and presenting data in graphs and tables. Relying on factors but misunderstanding them has been known to “eat up hours of valuable time in any given analysis”, as one member of the community put it. It is therefore a good investment to get them straight as soon as possible on your R journey.
The intent behind these exercises is to help you find and fill in the cracks and holes in your relationship with factor variables.
Solutions are available here.
Exercise 1
Load the gapminder data-set from the gapminder package. Save it to an object called gp
. Check programmatically how many factors it contains and how many levels each factor has.
Exercise 2
Notice that one continent, Antarctica, is missing from the corresponding factor – add it as the last level of six.
Exercise 3
Actually, you change your mind. There is no permanent human population on Antarctica. Drop this (unused) level from your factor. Can you find three ways to do this, then you are an expert.
Exercise 4
Again, modify the continent factor, making it more precise. Add two new levels instead of Americas, North-America and South-America. The countries in the following vector should be classified as South-America and the rest as North-America.
c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador",
"Paraguay", "Peru", "Uruguay", "Venezuela")
Exercise 5
Get the levels of the factor in alphabetical order.
Exercise 6
Re-order the continent levels again so that they appear in order of total population in 2007.
Exercise 7
Reverse the order of the factor and define continents as an ordered factor.
Exercise 8
Make the continent an unordered factor again and set North-America as the first level, thus interpreted as a reference group in modelling functions such as lm()
.
Exercise 9
Turn the following messy vector into a factor with two levels: Female and Male, using the factor function. Use the labels argument in the factor() function (ps: you can save some time by applying tolower() and trimws() before you apply factor()).
gender <- c("f", "m ", "male ","male", "female", "FEMALE", "Male", "f", "m")
Exercise 10
Use the fact that factors are built on top of integers and create a dummy (binary) variable male
that takes the value 1 if the gender has the value “Male.”
More tutorials about R factors:
https://predictivehacks.com/rename-and-relevel-factors-in-r/