In this introduction to factors series, we will approach factors in a somewhat uncommon way.
Answers to the exercises are available here. For other parts in this series, follow the tag factors like a pro.
This exercise assumes that you have worked with R in some capacity in the past. Specifically, this exercise assumes you have worked with factors
before but are not an expert in it. If you do not know what factor
variables are, then you might want to find a more suitable exercise towards R beginners, you might still find the exercise approachable (with some difficulty).
Let’s begin a deep dive into factor variable in R.
Before starting, it might be helpful to learn about what attributes()
are in R. Attributes are used to store metadata about the data object (e.g. plain old R objects) in R. Any R object can contain attributes. In fact, you probably have used and relied on attributes to perform your R functions. Consider the following R code:
rownames(mtcars)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" ## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant" ## [7] "Duster 360" "Merc 240D" "Merc 230" ## [10] "Merc 280" "Merc 280C" "Merc 450SE" ## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood" ## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" ## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona" ## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28" ## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E"
The code above returns the rownames of the mtcars
dataset. Simply enough, right? But have you ever wondered how this is different from, say, a matrix of the same size and data values? It’s in the attributes (and this is how many R objects are constructed)
Exercise 1
First see what R console returns as the result of mtcars
, and then see what the result is when you run attributes()
on mtcars
Exercise 2
You may wonder “Okay, so there are some info about mtcars. So what?” Well these attributes come in quite handy in understanding how factor variables behave.
Define a variable called numeric_var
with the values 1, 2, 3, 4, and 5
. Create a factor variable called factor_numeric_var
with the same values. Then compare the attributes for both numeric_var
and factor_var
.
Exercise 3
If you wondered what the difference between a numeric variable and a factor variable with the same values, then you see the reason now. What about a character variable and a factor variable with the same values?
Define a variable called character_var
with the values A, B, C, C, and C
. Create a factor variable called factor_character_var
with the same values. Then compare the attributes for both numeric_var
and factor_var
.
Exercise 4
By now you notice that factor variables have this attribute called levels
. If you have used factor variables before (as you should have), then you probably have used levels()
function to do something to the factor variable. levels()
is modifying this attribute and something else as well, as you will see.
Modify the first level of factor_character_var
from “A” to “a”. Before and after doing this modification, look at the levels of factor_character_var
and compare them. Also look at how factor_character_var
itself changed.
- Work with different data types,
- know what the different data types represent and when to apply them in your modelling,
- And much more
Exercise 5
Interesting. So what if we try to change the second value of factor_character_var
from “B” to “b” by simply assigning to it?
Modify the second value of factor_character_var
from “B” to “b”. Before and after doing this modification, look at the levels of factor_character_var
and compare them. Also look at how factor_character_var
itself changed. Don’t be surprised if something unexpected happens.
Exercise 6
See how the levels stayed the same but the value themselves became NA
? This is because levels attribute contains info about all the possible values the factor variable can be. Because you attempted to change the second value “B” to “b” and “b” doesn’t exist, you are getting an invalid factor level error.
Fix the problem from Exercise 6 by using technique you learned in Excercise 4. Make sure both the levels and the value for second index become “b”.
Exercise 7
We’ve so far covered what happens if you change levels and values of factor variable that only appears once in the vector (“A” and “B”). What if we change “C” level to “c”? Use the appropriate method to change the level of “C” to “c”.
Exercise 8
You might wonder “If (one of the) attributes of a factor variable is the same as the levels of a factor variable, then can I directly change the value of the levels
attribute of a factor variable”? Indeed you can, but this is not the correct way to do this. According to the official R documentation, using levels()
is not the same and is preferred to modifying the levels via attributes. For sake of completeness and to learn how you might want to assign values to attributes, let’s try the following exercise.
Modify the level “c” to “c_via_attribute” by assigning the new level to the attribute of the factor_character_var
variable. Just to repeat, this is NOT the correct way to modify levels for a factor variable in R. We are doing this exercise for sake of completeness and to learn about how attribute value assignment work!
Exercise 9
In most statistical modeling cases, your variable maybe “coded” as simple characters to stand for something much more informative. For example, a patient’s sex may be coded as “M” or “F”, but they mean “Male” or “Female”. This is much more common scenario than the exercises above.
Create a factor variable sex
with c("M", "M", "F")
as it’s values. Change the levels of sex
to “Male” and “Female” for “M” and “F” respectively. You should first try the method learned above. Also attempt to use factor
function to do all this in one step. Consult the factor()
documentation for help.
Exercise 10
You see some interesting results about the levels of sex variable c("M", "M", "F")
. The order of the levels start to matter. Why is this? By default, R factor levels are computed alphabetically. Whenever you modify levels or assign labels to factor either via levels()
, you have to take note of the default ordering of R factor levels (alphabetical). As you saw in previous exercises, changing the levels values changes the values themselves. If your new levels & labels are in a different order, then the underlying values themselves will change. The future articles will discuss how the order of factor levels matter. For the last exercise, use or by
factor(..., levels = ..., labels = ...)factor()
to explicitly define the levels to be “Male” and “Female” in this order.
Leave a Reply