In this exercise, we will explore how to define a factor. To learn the fundamentals of how factor variables are constructed, check out the previous exercise here.
In the last exercise, we learned that the order of the levels and labels matter when creating a factor variable using
factor(..., levels = ..., labels = ...). As a reminder:
Here we defined a factor variable called
female_male_sex with F as the first factor level and M as the second factor level. From the previous exercise, you should have learned that
- computes all the unique values of
my_variableand stores them as levels attributes
- levels are ordered alphabetically by default
To explicitly define the order of the the levels at the factor variable creation, you have to use the
Note that the order of the values
c("M", "M", "F") did not change, and it shouldn’t because the order of the values matter if these are data observations. The only difference between the two variables
female_male_sex seem to be the levels attributes… or is it?
Using the function
str() which returns “the internal structure of an R object” (via R documentation
?str()), compare the two variables
The output of
str() is displaying an important fact about factors: that the factor variable values are internally stored as numerical values. Specifically, they are stored as numeric indices of the levels. We may cover why this is so in another exercise, but for now the simple answer is because of efficiency. It is more efficient to store numbers in computers than character strings. This clever use of indices is also why factor variables have levels attributes.
typeof to compare the type of
c("M", "M", "F") and the factor version of the character values. The Exercise 2 results will confirm the results of Exercise 1. You can also get the indices by using
as.numeric to tell R to return the indices of the factor variable.
Let’s look at what happens when the vector values we provide to
factor() has missing value (
NA) or null obejct (
NULL). Define factor variables
female_male_sex_with_NULL where the values are
c("M", "M", "F") and the fourth value is NA or NULL, respectively. Look at the output values, the levels, and the length of the variable.
NULL value pretty much gets ignored from the generated factor variable.
NA is not included as part of computing the list of levels in the default R
factor(). Why the values of the factor variables become
was covered from the last tutorial.
If we want to use
NA to be a level, we use the parameter
exclude=NULL to include it from the list of levels from the factor variable. Repeat the previous exercise with
exclude=NULL and compare the results.
NA is the last level. Let’s see what other things we can do with the
Create a factor variable with values
c("M", "M", "F", NA) and exclude
NA. Also create another factor variable excluding only
"F". Create a third variable excluding both
"F". Also look at the structure for all of these factor variables.
You can see that both the character values and the internal numeric index values are changing when you use the
exclude parameter. The
exclude parameter can be used to add or remove
NA from the factor levels. This fact will come in handy in future exercises about how other R functions treat factor variables.
Let’s switch our focus on factor variables that are not characters. Factor variables don’t have to be character variables. They can be numeric. An example of this scenario is when a variable you are using is encoding sex as 0 or 1 where 0 is “F” and 1 is “M”. Let’s define a numerical factor variable
numeric_sex> whiere instead of
c("M", "M", "F", NA), the values are
c(1, 1, 0). Look at the levels and the structure.
Another default behavior of
factor() is that numeric levels are ordered by numeric order. Hmm, so what if the variable values were mixture of alphabets and numbers? Look at the factor variables for
c("a", 0, 1, "b") and
c("a", "0", "1", "b")
Last exercise was a trick exercise. They actually produce identical results because when you mix character values and numeric values in one vector variable, then all numeric values get converted to characters. The exercise also shows that levels are by default ordered by numbers then alphabets.
Let’s see if the behavior of factors from numeric variables are the same for NA and NULL. Repeat Excercise 3 by comparing
c(1, 2, 3, NA) and
c(1, 2, 3, NULL)
Similarly, repeat Exercise 4 and 5.
Remember how each value of factor variable is actually the index of the levels of the factor variable? Take a look what the following factor structure results in. Take very careful note of the indices and how the relate to the levels. Just because the underlying factor levels are numbers, you should not be confusing them with the internal indices of the corresponding factor variable. In fact, this is a common mistake to use
as.numeric to fetch the original numbers
<!–begin.rcode, echo=TRUE, eval=TRUE, message=FALSE