Fighting Factors with Cats: Exercises

 

In this exercise set, we will practice using the forcats factor manipulation package by Hadley Wickham. In the last exercise set, we saw that it is entirely possible to deal with factors in base R,  but also that things can get a bit involved and un-intuitive. Forcats simplifies many common factor manipulation tasks and worth mastering if you cannot avoid using factors in your work. Also, studying the package and its source code can give you ideas for writing your own custom function to simplify everyday tasks that you think can be dealt with in a better way.

Solutions are available here.

Exercise 1

Load the gapminder data-set from the gapminder package, as well as forcats. Check what the levels of the continent factor variable are and their frequency in the data.

Exercise 2

Notice that one continent, Antarctica, is missing – add it as the last level of six.

Exercise 3

Actually, you change your mind. There is no permanent human population on Antarctica. Drop this (unused) level from your factor.

Exercise 4

Again, modify the continent factor, making it more precise. Add two new levels: instead of Americas, add North America and South America. The countries in the following vector should be classified as South America and the rest as North America.

c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador",
"Paraguay", "Peru", "Uruguay", "Venezuela")

Exercise 5

Arrange the levels of the continent factor in alphabetical order.

Exercise 6

Re-order the continent levels again so that they appear in order of total population in 2007.

Exercise 7

Reverse the order of the factors.

Exercise 8

Make continent, again, an unordered factor. Set North America as the first level, therefore interpreted as a reference group in modeling functions such as lm().

Exercise 9

Turn the following messy vector into a factor with two levels: “Female” and “Male” using the factor function. Use the labels argument in the factor() function.
gender <- c("f", "m ", "male ","male", "female", "FEMALE", "Male", "f", "m")

Exercise 10

Gender can be considered sensitive data. Convert the gender variable into a factor that takes the integer values “1” and “2”, where one integer represents female and the other male, but make the choice randomly.




Fighting Factors With Cats: Solutions

Below are the solutions to these exercises on “Fighting Factors With Cats.”

####################
#                  #
#    Exercise 1    #
#                  #
####################
library(gapminder)
library(forcats)
# Solution based on version:
packageVersion("forcats")
## [1] '0.3.0'
packageVersion("gapminder")
## [1] '0.3.0'
gp <- gapminder

fct_count(gp$continent)
## # A tibble: 5 x 2
##   f            n
##   <fct>    <int>
## 1 Africa     624
## 2 Americas   300
## 3 Asia       396
## 4 Europe     360
## 5 Oceania     24
# Btw. Did you notice the following?
# Helps you remember the name of the package
set.seed(5945) # How did I know to set the seed like that?
paste(strsplit("forcats", "")[[1]][sample(1:7)], collapse = "")
## [1] "factors"
####################
#                  #
#    Exercise 2    #
#                  #
####################
gp$continent <- fct_expand(gp$continent, "Antarctica")

# See how it changed:
fct_count(gp$continent)
## # A tibble: 6 x 2
##   f              n
##   <fct>      <int>
## 1 Africa       624
## 2 Americas     300
## 3 Asia         396
## 4 Europe       360
## 5 Oceania       24
## 6 Antarctica     0
####################
#                  #
#    Exercise 3    #
#                  #
####################
gp$continent <- fct_drop(gp$continent)
# See how it changed:
fct_count(gp$continent)
## # A tibble: 5 x 2
##   f            n
##   <fct>    <int>
## 1 Africa     624
## 2 Americas   300
## 3 Asia       396
## 4 Europe     360
## 5 Oceania     24
####################
#                  #
#    Exercise 4    #
#                  #
####################
samer_c <- c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador",
  "Paraguay", "Peru", "Uruguay", "Venezuela")
gp$continent <- fct_expand(gp$continent, "South America", "North America")


gp$continent[gp$country %in% samer_c] <- "South America"
gp$continent[gp$continent == "Americas"] <- "North America"

# Now drop Americas
gp$continent <- fct_drop(gp$continent)

# See how it changed:
fct_count(gp$continent)
## # A tibble: 6 x 2
##   f                 n
##   <fct>         <int>
## 1 Africa          624
## 2 Asia            396
## 3 Europe          360
## 4 Oceania          24
## 5 South America   120
## 6 North America   180
####################
#                  #
#    Exercise 5    #
#                  #
####################
gp$continent <- fct_relevel(gp$continent, sort(levels(gp$continent)))
# See how it changed:
fct_count(gp$continent)
## # A tibble: 6 x 2
##   f                 n
##   <fct>         <int>
## 1 Africa          624
## 2 Asia            396
## 3 Europe          360
## 4 North America   180
## 5 Oceania          24
## 6 South America   120
####################
#                  #
#    Exercise 6    #
#                  #
####################
# fct_reorder seems  perfect for simpler tasks but this query seems a bit too 
# complex for it.
# But we can use our favorite data
# manipulation technique and feed the results to fct_relevel
total_pop_c <- aggregate(pop ~ continent, gp[gp$year == 2007, ], sum)


gp$continent <- fct_relevel(
  gp$continent,
  as.character(total_pop_c[order(-total_pop_c$pop), ]$continent)
)


####################
#                  #
#    Exercise 7    #
#                  #
####################
gp$continent <- fct_rev(gp$continent)


####################
#                  #
#    Exercise 8    #
#                  #
####################
gp$continent <- fct_relevel(gp$continent, "North America")


####################
#                  #
#    Exercise 9    #
#                  #
####################
gender <- c("f", "m ", "male ","male", "female", "FEMALE", "Male", "f", "m")
gender <- as_factor(gender)
gender <- fct_collapse(
  gender,
  Female = c("f", "female", "FEMALE"),
  Male   = c("m ", "m", "male ", "male", "Male")
)
fct_count(gender)
## # A tibble: 2 x 2
##   f          n
##   <fct>  <int>
## 1 Female     4
## 2 Male       5
# Or using fct_relable()
gender <- c("f", "m ", "male ","male", "female", "FEMALE", "Male", "f", "m")
gender <- as_factor(gender)
gender <- fct_relabel(gender, ~ ifelse(tolower(substring(., 1, 1)) == "f", "Female", "Male"))

fct_count(gender)
## # A tibble: 2 x 2
##   f          n
##   <fct>  <int>
## 1 Female     4
## 2 Male       5
####################
#                  #
#    Exercise 10   #
#                  #
####################
gender <- fct_anon(gender)
fct_count(gender)
## # A tibble: 2 x 2
##   f         n
##   <fct> <int>
## 1 1         5
## 2 2         4



Facing The Facts About Factors: Solutions

Below are the solutions to these exercises on “Facing The Facts About Factors.”

####################
#                  #
#    Exercise 1    #
#                  #
####################
library(gapminder)
gp <- gapminder


# How many factors?
sum(sapply(gp, is.factor))
## [1] 2
# How many levels does each have?
lapply(Filter(is.factor, gp), nlevels)
## $country
## [1] 142
## 
## $continent
## [1] 5
# There are a number of other ways achieve this.

####################
#                  #
#    Exercise 2    #
#                  #
####################

# See before
attributes(gp$continent)
## $levels
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania" 
## 
## $class
## [1] "factor"
levels(gp$continent) <- c(levels(gp$continent), "Antartica")

# See how it changed:
attributes(gp$continent)
## $levels
## [1] "Africa"    "Americas"  "Asia"      "Europe"    "Oceania"   "Antartica"
## 
## $class
## [1] "factor"
table(gp$continent)
## 
##    Africa  Americas      Asia    Europe   Oceania Antartica 
##       624       300       396       360        24         0
####################
#                  #
#    Exercise 3    #
#                  #
####################

# Method 1
gp$continent <- droplevels(gp$continent)
# Method 2
gp$continent <- factor(gp$continent)
# Method 3
gp$continent  <- gp$continent[, drop = TRUE]
# There are definitely more - leave your solution in the comment section


####################
#                  #
#    Exercise 4    #
#                  #
####################
levels(gp$continent) <- c(levels(gp$continent), "North America", "South America")

samer_c <- c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador",
  "Paraguay", "Peru", "Uruguay", "Venezuela")

gp$continent[gp$country %in% samer_c] <- "South America"
gp$continent[gp$continent == "Americas"] <- "North America"
gp$continent <- droplevels(gp$continent)


table(gp$continent)
## 
##        Africa          Asia        Europe       Oceania North America 
##           624           396           360            24           180 
## South America 
##           120
####################
#                  #
#    Exercise 5    #
#                  #
####################
gp$continent <- factor(gp$continent, levels = sort(levels(gp$continent)))
levels(gp$continent)
## [1] "Africa"        "Asia"          "Europe"        "North America"
## [5] "Oceania"       "South America"
####################
#                  #
#    Exercise 6    #
#                  #
####################

# R provides the reorder function
gp2007 <- gp[gp$year == 2007, ]
gp2007$continent <- reorder(gp2007$continent, -gp2007$pop, sum)
gp$continent <- factor(gp$continent, levels = levels(gp2007$continent))
levels(gp$continent)
## [1] "Asia"          "Africa"        "Europe"        "North America"
## [5] "South America" "Oceania"
# But since we are dealing with a subset.. it is maybe not the most efficient 
# method, here is an alternative:
popcon07 <- aggregate(pop ~ continent, gp[gp$year == 2007, ], sum)
gp$continent <- factor(
  gp$continent,
  levels = popcon07[order(popcon07$pop, decreasing = TRUE), "continent"]
)
levels(gp$continent )
## [1] "Asia"          "Africa"        "Europe"        "North America"
## [5] "South America" "Oceania"
####################
#                  #
#    Exercise 7    #
#                  #
####################
gp$continent <- factor(
  gp$continent,
  levels = rev(levels(gp$continent)),
  ordered = TRUE
)

# Now you can do comparisons such as:
head(gp$continent, 50) >= "Africa"
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [12]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [34]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [45]  TRUE  TRUE  TRUE  TRUE FALSE FALSE
# (and contrasting in models will be different)

####################
#                  #
#    Exercise 8    #
#                  #
####################

# Method 1
# Ordered is just an addition to the class, so we can removed ordered from there:
class(gp$continent)
## [1] "ordered" "factor"
class(gp$continent) <- "factor"
class(gp$continent)
## [1] "factor"
# Method 2 
gp$continent <- factor(gp$continent, ordered = FALSE)

levels(gp$continent)
## [1] "Oceania"       "South America" "North America" "Europe"       
## [5] "Africa"        "Asia"
gp$continent <- relevel(gp$continent, ref = "North America")
levels(gp$continent)
## [1] "North America" "Oceania"       "South America" "Europe"       
## [5] "Africa"        "Asia"
####################
#                  #
#    Exercise 9    #
#                  #
####################
gender <-  c("f", "m ", "male ","male", "female", "FEMALE", "Male", "f", "m")

# Start by cleaning it a bit:
gender <- trimws(tolower(gender))
# What are the unique entries now?
unique(gender)
## [1] "f"      "m"      "male"   "female"
# Then turn to factor
gender <- factor(
  gender,
  levels = c("f", "female", "m", "male"),
  labels = rep(c("Female", "Male"), each = 2)
)
gender
## [1] Female Male   Male   Male   Female Female Male   Female Male  
## Levels: Female Male
####################
#                  #
#    Exercise 10   #
#                  #
####################

male <- as.integer(gender) - 1
male
## [1] 0 1 1 1 0 0 1 0 1
# also possible
male <- unclass(gender) - 1


# What about this?
deattr <- function(x) {
  attributes(x) <- NULL
  x
}

male <- deattr(gender) - 1
male
## [1] 0 1 1 1 0 0 1 0 1



Facing the Facts about Factors: Exercises

Factor variables in R can be mind-boggling. Often, you can just avoid them and use characters vectors instead – just don’t forget to set stringsAsFactors=FALSE. They are, however, very useful in some circumstances, such as statistical modelling and presenting data in graphs and tables. Relying on factors but misunderstanding them has been known to “eat up hours of valuable time in any given analysis”, as one member of the community put it. It is therefore a good investment to get them straight as soon as possible on your R journey.

The intent behind these exercises is to help you find and fill in the cracks and holes in your relationship with factor variables.

Solutions are available here.

Exercise 1

Load the gapminder data-set from the gapminder package. Save it to an object called gp. Check programmatically how many factors it contains and how many levels each factor has.

Exercise 2

Notice that one continent, Antarctica, is missing from the corresponding factor – add it as the last level of six.

Exercise 3

Actually, you change your mind. There is no permanent human population on Antarctica. Drop this (unused) level from your factor. Can you find three ways to do this, then you are an expert.

Exercise 4

Again, modify the continent factor, making it more precise. Add two new levels instead of Americas, North-America and South-America. The countries in the following vector should be classified as South-America and the rest as North-America.

c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia", "Ecuador",
"Paraguay", "Peru", "Uruguay", "Venezuela")

Exercise 5

Get the levels of the factor in alphabetical order.

Exercise 6

Re-order the continent levels again so that they appear in order of total population in 2007.

Exercise 7

Reverse the order of the factor and define continents as an ordered factor.

Exercise 8

Make the continent an unordered factor again and set North-America as the first level, thus interpreted as a reference group in modelling functions such as lm().

Exercise 9

Turn the following messy vector into a factor with two levels: Female and Male, using the factor function. Use the labels argument in the factor() function (ps: you can save some time by applying tolower() and trimws() before you apply factor()).
gender <- c("f", "m ", "male ","male", "female", "FEMALE", "Male", "f", "m")

Exercise 10

Use the fact that factors are built on top of integers and create a dummy (binary) variable male that takes the value 1 if the gender has the value “Male.”




Melt and Cast The Shape of Your Data-Frame: Solutions

Below are the solutions to these exercises on “Melt and Cast The Shape of Your Data-Frame.”

####################
#                  #
#    Exercise 1    #
#                  #
####################
suppressMessages(library(data.table))
df <- data.frame(
  id = 1:2,
  q1 = c("A", "B"),
  q2 = c("C", "A"),
  stringsAsFactors = FALSE
)
df
##   id q1 q2
## 1  1  A  C
## 2  2  B  A
dfl <- melt(df, id.vars = "id", variable.name = "question")
dfl
##   id question value
## 1  1       q1     A
## 2  2       q1     B
## 3  1       q2     C
## 4  2       q2     A
####################
#                  #
#    Exercise 2    #
#                  #
####################
dcast(dfl, id ~ question, value.var = "value")
##   id q1 q2
## 1  1  A  C
## 2  2  B  A
####################
#                  #
#    Exercise 3    #
#                  #
####################
dcast(dfl, question ~ paste0("id_", id))
##   question id_1 id_2
## 1       q1    A    B
## 2       q2    C    A
####################
#                  #
#    Exercise 4    #
#                  #
####################
df2 <- data.frame(
  A = c("A1", "A12", "A31", "A4"),
  B = c("B4", "C7", "C3", "B9"),
  C = c("C3", "B16", "B3", "C4")
)
setDT(df2)

df2l <- melt(df2[, id := .I], id.vars = "id")
dcast(df2l, id ~ substr(value, 1, 1))[, -c("id")]
##      A   B  C
## 1:  A1  B4 C3
## 2: A12 B16 C7
## 3: A31  B3 C3
## 4:  A4  B9 C4
# Inspired by this question on SO:
# https://stackoverflow.com/a/50841771/4552295


####################
#                  #
#    Exercise 5    #
#                  #
####################
df3 <- data.frame(
  Join_ID = rep(1:3, each = 2),
  Type    = rep(c("a", "b"), 3),
  v2      = c(8, 9, 7, 6, 5, 4)*10
)

dcast(df3, Join_ID ~ paste0(Type, "_v2"), value.var = "v2")
##   Join_ID a_v2 b_v2
## 1       1   80   90
## 2       2   70   60
## 3       3   50   40
# https://stackoverflow.com/q/50839606/4552295


####################
#                  #
#    Exercise 6    #
#                  #
####################
library(AER)
data("Fertility")

Fertility$mother_id <- 1:nrow(Fertility)
ferl <- melt(Fertility, measure.vars = paste0("gender", 1:2), value.name = "gender", variable.name = "order")
ferl$order <- gsub("[a-z]", "", ferl$order)
head(ferl)
##   morekids age afam hispanic other work mother_id order gender
## 1       no  27   no       no    no    0         1     1   male
## 2       no  30   no       no    no   30         2     1 female
## 3       no  27   no       no    no    0         3     1   male
## 4       no  35  yes       no    no    0         4     1   male
## 5       no  30   no       no    no   22         5     1 female
## 6       no  26   no       no    no   40         6     1   male
####################
#                  #
#    Exercise 7    #
#                  #
####################
d1 = data.frame(
  ID=c(1,1,1,2,2,4,1,2),
  medication=c(1,2,3,1,2,7,2,8)
)
setDT(d1)
d1[, .(medications = paste0(medication, collapse = ", ")), by = .(ID)]
##    ID medications
## 1:  1  1, 2, 3, 2
## 2:  2     1, 2, 8
## 3:  4           7
####################
#                  #
#    Exercise 8    #
#                  #
####################
dfs <- data.frame(
  Name = c(rep("name1",3),rep("name2",2)),
  MedName = c("atenolol 25mg","aspirin 81mg","sildenafil 100mg", "atenolol 50mg","enalapril 20mg")
)

setDT(dfs)
dfs[, medn := paste0("medication_", 1:.N), by = Name]
dfs
##     Name          MedName         medn
## 1: name1    atenolol 25mg medication_1
## 2: name1     aspirin 81mg medication_2
## 3: name1 sildenafil 100mg medication_3
## 4: name2    atenolol 50mg medication_1
## 5: name2   enalapril 20mg medication_2
dcast(dfs, Name ~ medn, value.var = "MedName")
##     Name  medication_1   medication_2     medication_3
## 1: name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
## 2: name2 atenolol 50mg enalapril 20mg             <NA>
# or even cleaner:
dcast(dfs, Name ~ rowid(Name, prefix = "medication"), value.var = "MedName")
##     Name   medication1    medication2      medication3
## 1: name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
## 2: name2 atenolol 50mg enalapril 20mg             <NA>
# Inspired by
# https://stackoverflow.com/q/11322801/4552295

####################
#                  #
#    Exercise 9    #
#                  #
####################
df7 <- data.frame(
  v1 = c("name1, name2", "name3", "name4, name5"),
  v2 = c("1, 2", "3", "4, 5"),
  v3 = c(1, 2, 3)
)
df7
##             v1   v2 v3
## 1 name1, name2 1, 2  1
## 2        name3    3  2
## 3 name4, name5 4, 5  3
setDT(df7)
df7[, lapply(.SD, tstrsplit, ", "), by = v3][, .(v1,v2,v3)]
##       v1 v2 v3
## 1: name1  1  1
## 2: name2  2  1
## 3: name3  3  2
## 4: name4  4  3
## 5: name5  5  3
# This was a real problem on SO: 
# https://stackoverflow.com/q/29758504/4552295


####################
#                  #
#    Exercise 10   #
#                  #
####################
df <- data.frame(
  Method = c("10.fold.CV Lasso", "10.fold.CV.1SE", "BIC", "Modified.BIC"),
  n      = c(30, 30, 50, 50, 50, 50, 100, 100),
  lambda = c(1, 3, 1, 2, 2, 0, 1, 2),
  df     = c(21, 17, 29, 26, 25, 32, 34, 32)
)

dcast(df, Method ~ n, fill = "")
##             Method 30 50 100
## 1 10.fold.CV Lasso 21 25    
## 2   10.fold.CV.1SE 17 32    
## 3              BIC    29  34
## 4     Modified.BIC    26  32
df %>%
  melt(id.vars = c("Method", "n")) %>%
  dcast(Method ~ variable + n, fill = "")
## Error in df %>% melt(id.vars = c("Method", "n")) %>% dcast(Method ~ variable + : could not find function "%>%"
# Inspired by:
# https://stackoverflow.com/q/50904997/4552295



Melt and Cast The Shape of Your Data-Frame: Exercises

Data-sets often arrive to us in a form that is different from what we need for our modeling or visualization functions, which, in turn, don’t necessarily require the same format.

Reshaping data.frames is a step that all analysts need, but many struggle with. Practicing this meta-skill will, in the long-run, result in more time to focus on the actual analysis.

The solutions to this set will rely on data.table, mostly melt() and dcast(), which are originally from the reshape2 package. However, you can also get practice out of it using your favorite base-R, tidy-verse or any other method, then compare the results.

Solutions are available here.

 

Exercise 1

Take the following data.frame from this form:

df <- data.frame(id = 1:2, q1 = c("A", "B"), q2 = c("C", "A"), stringsAsFactors = FALSE)
df
  id q1 q2
1  1  A  C
2  2  B  A 

to this:

  id question value
1  1       q1     A
2  2       q1     B
3  1       q2     C
4  2       q2     A

 

Exercise 2

Do the opposite; return the data.frame back to it’s original form.

Exercise 3

Set up the data.frame in terms of questions, as follows:

  question id_1 id_2
1       q1    A    B
2       q2    C    A

 

Exercise 4

The data entry behind this data.frame went a little bit wrong. Get all the C and B entries into their corresponding columns:

df2 <- data.frame(
  A = c("A1", "A12", "A31", "A4"), 
  B = c("B4", "C7", "C3", "B9"), 
  C = c("C3", "B16", "B3", "C4")
)

 

Exercise 5

Get this data.frame:

df3 <- data.frame(
  Join_ID = rep(1:3, each = 2), 
  Type    = rep(c("a", "b"), 3), 
  v2      = c(8, 9, 7, 6, 5, 4)*10
)

 

To look like this:

  Join_ID a_v2 b_v2
1       1   80   90
2       2   70   60
3       3   50   40

 

Exercise 6

Revisiting a data-set used in an earlier exercise set on data exploration,
load the AER package and run the command data("Fertility"), which loads the data-set Fertility to your work space.
Melt it into the following format, with one row per child.

head(ferl)
  morekids age afam hispanic other work mother_id order gender
1       no  27   no       no    no    0         1     1   male
2       no  30   no       no    no   30         2     1 female
3       no  27   no       no    no    0         3     1   male
4       no  35  yes       no    no    0         4     1   male
5       no  30   no       no    no   22         5     1 female
6       no  26   no       no    no   40         6     1   male

 

Exercise 7

Take this:

d1 = data.frame(
  ID=c(1,1,1,2,2,4,1,2), 
  medication=c(1,2,3,1,2,7,2,8)
)
d1
  ID medication
1  1          1
2  1          2
3  1          3
4  2          1
5  2          2
6  4          7
7  1          2
8  2          8

to this form:

  
   ID medications
1:  1  1, 2, 3, 2
2:  2     1, 2, 8
3:  4           7

 

Note: the solution doesn’t use melt() nor dcast(), so you might look at other options.

Exercise 8

Get this:

dfs <- data.frame(
  Name = c(rep("name1",3),rep("name2",2)),
  MedName = c("atenolol 25mg","aspirin 81mg","sildenafil 100mg", "atenolol 50mg","enalapril 20mg")
)
dfs
   Name          MedName
1 name1    atenolol 25mg
2 name1     aspirin 81mg
3 name1 sildenafil 100mg
4 name2    atenolol 50mg
5 name2   enalapril 20mg

 

Into the following format:

    Name  medication_1   medication_2     medication_3
1: name1 atenolol 25mg   aspirin 81mg sildenafil 100mg
2: name2 atenolol 50mg enalapril 20mg             

 

Exercise 9

Get the following data.frame organized in standard form:

df7 <- data.table(
  v1 = c("name1, name2", "name3", "name4, name5"),
  v2 = c("1, 2", "3", "4, 5"), 
  v3 = c(1, 2, 3)
)
df7
             v1   v2 v3
1: name1, name2 1, 2  1
2:        name3    3  2
3: name4, name5 4, 5  3

 

Expected output:

 
      v1 v2 v3
1: name1  1  1
2: name2  2  1
3: name3  3  2
4: name4  4  3
5: name5  5  3

 

The solution doesn’t use melt() nor dcast() and can be suprisingly hard.

Exercise 10

Convert this:

 
df <- data.frame(
 Method = c("10.fold.CV Lasso", "10.fold.CV.1SE", "BIC", "Modified.BIC"),
 n = c(30, 30, 50, 50, 50, 50, 100, 100),
 lambda = c(1, 3, 1, 2, 2, 0, 1, 2), 
df = c(21, 17, 29, 26, 25, 32, 34, 32) ) 
> df
            Method   n lambda df
1 10.fold.CV Lasso  30      1 21
2   10.fold.CV.1SE  30      3 17
3              BIC  50      1 29
4     Modified.BIC  50      2 26
5 10.fold.CV Lasso  50      2 25
6   10.fold.CV.1SE  50      0 32
7              BIC 100      1 34
8     Modified.BIC 100      2 32

Into:

 
            Method lambda_30 lambda_50 lambda_100 df_30 df_50 df_100
1 10.fold.CV Lasso         1         2               21    25       
2   10.fold.CV.1SE         3         0               17    32       
3              BIC                   1          1          29     34
4     Modified.BIC                   2          2          26     32  

 

(Image by Joe Alterio)




Sharpening The Knives in The data.table Toolbox: Solutions

Below are the solutions to these exercises on “Sharpening The Knives in The data.table Toolbox.”

####################
#                  #
#    Exercise 1    #
#                  #
####################
library(gapminder)
library(data.table)
gp <- gapminder
# Set as data.table
setDT(gp)

gp[, uniqueN(country)]
## [1] 142
####################
#                  #
#    Exercise 2    #
#                  #
####################
gp[, gdpPercap_l1 := shift(gdpPercap), by = country]
head(gp)
##        country continent year lifeExp      pop gdpPercap gdpPercap_l1
## 1: Afghanistan      Asia 1952  28.801  8425333  779.4453           NA
## 2: Afghanistan      Asia 1957  30.332  9240934  820.8530     779.4453
## 3: Afghanistan      Asia 1962  31.997 10267083  853.1007     820.8530
## 4: Afghanistan      Asia 1967  34.020 11537966  836.1971     853.1007
## 5: Afghanistan      Asia 1972  36.088 13079460  739.9811     836.1971
## 6: Afghanistan      Asia 1977  38.438 14880372  786.1134     739.9811
####################
#                  #
#    Exercise 3    #
#                  #
####################
gp[year == 2007, .(country, continent, growth07 = (gdpPercap / gdpPercap_l1) - 1)
   ][order(growth07), .(country = last(country), growth07 = last(growth07)), continent]
##    continent             country  growth07
## 1:      Asia            Cambodia 0.9122171
## 2:    Africa              Angola 0.7297996
## 3:  Americas Trinidad and Tobago 0.5713408
## 4:    Europe          Montenegro 0.4112585
## 5:   Oceania           Australia 0.1221208
# Alternatively you can extract the last observation with .N
gp[year == 2007, .(country, continent, growth07 = (gdpPercap / gdpPercap_l1) - 1)
   ][order(growth07), .(country = country[.N], growth07 = growth07[.N]), continent]
##    continent             country  growth07
## 1:      Asia            Cambodia 0.9122171
## 2:    Africa              Angola 0.7297996
## 3:  Americas Trinidad and Tobago 0.5713408
## 4:    Europe          Montenegro 0.4112585
## 5:   Oceania           Australia 0.1221208
####################
#                  #
#    Exercise 4    #
#                  #
####################
temp <- names(gp)
setnames(gp, "year", "anno")
temp
## [1] "country"      "continent"    "anno"         "lifeExp"     
## [5] "pop"          "gdpPercap"    "gdpPercap_l1"
address(temp)
## [1] "0000000015951F98"
address(names(gp))
## [1] "0000000015951F98"
# Both are actually just referring to the same object "<-" passed the the names only be reference. 
# Being aware of this is the price of the speed data.table gives.
# No such thing as a free lunch


####################
#                  #
#    Exercise 5    #
#                  #
####################
data(gapminder)
gp <- gapminder
setDT(gp)
temp <- copy(names(gp))
setnames(gp, "year", "anno")
temp
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
names(gp)
## [1] "country"   "continent" "anno"      "lifeExp"   "pop"       "gdpPercap"
address(temp)
## [1] "000000001BA9BC30"
address(names(gp))
## [1] "000000001BAB3A80"
# Convert factors to characters
factcols <- sapply(gp, is.factor)
factcols <- names(factcols)[factcols]
gp[, (factcols) := lapply(.SD, as.character), .SDcols = factcols]
# Actually there should be a cleaner way to do this without losing generalizability
# Please comment if you think you have the answer


####################
#                  #
#    Exercise 6    #
#                  #
####################
gA_2014 <- data.table(
  country   = c("Brazil", "Mexico", "Croatia", "Cameroon"),
  goals2014 = c(7, 4, 6, 1)
)

gA_2014[, pop_mill := gp[anno == 2007
                         ][chmatch(gA_2014$country, country), round(pop / 1e6)]]
gA_2014
##     country goals2014 pop_mill
## 1:   Brazil         7      190
## 2:   Mexico         4      109
## 3:  Croatia         6        4
## 4: Cameroon         1       18
####################
#                  #
#    Exercise 7    #
#                  #
####################
# First make sure data is ordered by country and year
gp <- gp[order(country, anno)]

# Years from first 8k
gp[, years_from8k := anno - anno[which(gdpPercap >= 8e3)[1]], country
   ][years_from8k < 0, years_from8k := NA]
head(gp)
##        country continent anno lifeExp      pop gdpPercap years_from8k
## 1: Afghanistan      Asia 1952  28.801  8425333  779.4453           NA
## 2: Afghanistan      Asia 1957  30.332  9240934  820.8530           NA
## 3: Afghanistan      Asia 1962  31.997 10267083  853.1007           NA
## 4: Afghanistan      Asia 1967  34.020 11537966  836.1971           NA
## 5: Afghanistan      Asia 1972  36.088 13079460  739.9811           NA
## 6: Afghanistan      Asia 1977  38.438 14880372  786.1134           NA
####################
#                  #
#    Exercise 8    #
#                  #
####################
gp[gdpPercap >= 8e3, obs8k_numb := rowid(country)]
# This is not the same kind of variable because countries could fall below 8k 
# again

gp[anno == 2007 & !is.na(obs8k_numb)
   ][order(obs8k_numb),
     .(country[obs8k_numb == max(obs8k_numb)], obs8k_numb[obs8k_numb == max(obs8k_numb)]),
     continent
     ]
##     continent             V1 V2
##  1:  Americas         Canada 12
##  2:  Americas  United States 12
##  3:    Africa          Gabon  9
##  4:    Africa          Libya  9
##  5:    Europe        Belgium 12
##  6:    Europe        Denmark 12
##  7:    Europe    Netherlands 12
##  8:    Europe         Norway 12
##  9:    Europe         Sweden 12
## 10:    Europe    Switzerland 12
## 11:    Europe United Kingdom 12
## 12:      Asia        Bahrain 12
## 13:      Asia         Kuwait 12
## 14:   Oceania      Australia 12
## 15:   Oceania    New Zealand 12
####################
#                  #
#    Exercise 9    #
#                  #
####################
gp[anno == 2002 & lifeExp %inrange% list(c(0,80), c(40, Inf))]
##             country continent anno lifeExp       pop  gdpPercap
## 1:        Australia   Oceania 2002  80.370  19546792 30687.7547
## 2: Hong Kong, China      Asia 2002  81.495   6762476 30209.0152
## 3:          Iceland    Europe 2002  80.500    288030 31163.2020
## 4:            Italy    Europe 2002  80.240  57926999 27968.0982
## 5:            Japan      Asia 2002  82.000 127065841 28604.5919
## 6:           Sweden    Europe 2002  80.040   8954175 29341.6309
## 7:      Switzerland    Europe 2002  80.620   7361757 34480.9577
## 8:           Zambia    Africa 2002  39.193  10595811  1071.6139
## 9:         Zimbabwe    Africa 2002  39.989  11926563   672.0386
##    years_from8k obs8k_numb
## 1:           50         11
## 2:           30          7
## 3:           45         10
## 4:           40          9
## 5:           35          8
## 6:           50         11
## 7:           50         11
## 8:           NA         NA
## 9:           NA         NA
####################
#                  #
#    Exercise 10   #
#                  #
####################
gA_2014b <- data.table(
  country   = c("Brazil", "Mexico", "Croatia", "Mexico"),
  goals2014 = c("7-2", "4-1", "6-6", "1-9")
)

gA_2014b[, c("goals_for", "goals_against") := tstrsplit(goals2014, "-")
         ][, goals2014 := NULL]
gA_2014b
##    country goals_for goals_against
## 1:  Brazil         7             2
## 2:  Mexico         4             1
## 3: Croatia         6             6
## 4:  Mexico         1             9



Sharpening the Knives in the data.table Toolbox: Exercises


If knowledge is power, then knowledge of data.table is something of a super power, at least in the realm of data manipulation in R.

In this exercise set, we will use some of the more obscure functions from the data.table package. The solutions will use set(), inrange(), chmatch(), uniqueN(), tstrsplit(), rowid(), shift(), copy(), address(), setnames() and last(). You are free to use more, as long as they are part of data.table. The objective is to get (more) familiar with these functions and be able to call on them in real-life, giving us fewer reasons to leave the fast and neat data.table universe.

Solutions are available here.

PS. If you are unfamiliar with data.table, we recommend you start with the exercises covering the basics of data.table.

Exercise 1

Load the gapminder data-set from the gapminder package. Save it to an object called “gp” and convert it to a data.table. How many different countries are covered by the data?

Exercise 2

Create a lag term for GDP per capita. That is the value of GDP at the last observation (which are 5 years apart) for each country.

Exercise 3

Using the data.table syntax, calculate the GDP per capita growth from 2002 to 2007 for each country. Extract the one with the highest value for each continent.

Exercise 4

Save the column names in a vector named “temp” and change the name of the year column in “gp” to “anno” (just because); print the temp. Oh my, what just happened? Check the memory address of temp and names(gp), respectively.

Exercise 5

Overwrite “gp” with the original data again. Now make a copy passed by value into temp (before you change the year to anno) so you can keep the original variable names. Check the addresses again. Also, change factors to characters and don’t forget to convert to data.table again.

Exercise 6

A data.table of the number of goals each team in group A made in the FIFA world championship is given below. Import this into R and add a column with the countries’ population in 2017 to the data.table, rounded to the nearest million.

gA_2014 <- data.table(
  country   = c("Brazil", "Mexico", "Croatia", "Cameroon"),  
  goals2014 = c(7, 4, 6, 1)
)
gA_2014
   country goals2014
1:   Brazil         7
2:   Mexico         4
3:  Croatia         6
4: Cameroon         1

 

Exercise 7

Calculate the number of years since the country reached $8k in GDP per capita at each relevant observation as accurately as the data allows.

Exercise 8

Add a subtly different variable using rowid(). That is the number of the observations among observations where the GDP is below 8k up to and including the given observation. Which country, in each continent, has the most observations above 8k? If there are ties, then list all of the those tied at the top.

Exercise 9

Use inrange() to extract countries that have their life expectancy either below 40 or above 80 in 2002.

Exercise 10

Now, the soccer/football data from exercise 6 came with goals made and goals made against each team as the following:

gA_2014b <- data.table(
  country   = c("Brazil", "Mexico", "Croatia", "Cameroon"),  
  goals2014 = c("7-2", "4-1", "6-6", "1-9")
)

How can you split the goals column into two relevant columns?

(Image by National Museum Wales)




Programmatically Creating Text Outputs in R: Solutions

Below are the solutions to these exercises on “Pro-grammatically Creating Text Outputs in R.”

####################
#                  #
#    Exercise 1    #
#                  #
####################
prices <- c(
  14.3409087337707, 13.0648270623048, 3.58504267621646, 18.5077076398145,
  16.8279241011882
)

sprintf("$%.2f", prices)
## [1] "$14.34" "$13.06" "$3.59"  "$18.51" "$16.83"
####################
#                  #
#    Exercise 2    #
#                  #
####################
fn <- c(25, 7, 90, 16)
sprintf("file_%03d.txt", fn)
## [1] "file_025.txt" "file_007.txt" "file_090.txt" "file_016.txt"
####################
#                  #
#    Exercise 3    #
#                  #
####################
fn <- c(25, 7, 90, 16)
sprintf("file_%0*d.txt", nchar(max(fn)), fn)
## [1] "file_25.txt" "file_07.txt" "file_90.txt" "file_16.txt"
####################
#                  #
#    Exercise 4    #
#                  #
####################
poeml <- c("Stay the patient course.", "Of little worth is your ire.", "The network is down.")
nmax <- max(nchar(poeml))
cat(sprintf("%*s", nmax, poeml), sep = "\n")
##     Stay the patient course.
## Of little worth is your ire.
##         The network is down.
####################
#                  #
#    Exercise 5    #
#                  #
####################

tohex <- function(x) {
  sprintf("%1$d is %1$x in hexadecimal", x)
}

tohex(12)
## [1] "12 is c in hexadecimal"
####################
#                  #
#    Exercise 6    #
#                  #
####################
title <- "A great poem"
sprintf("<h1>%s</h1>", title)
## [1] "<h1>A great poem</h1>"
# shiny::h1(title)


####################
#                  #
#    Exercise 7    #
#                  #
####################
library(magrittr)
poeml  %>%
  sprintf("<li>%s</li>", .) %>%
  paste(., collapse = " ") %>%
  sprintf("<ul>%s</ul>", .)
## [1] "<ul><li>Stay the patient course.</li> <li>Of little worth is your ire.</li> <li>The network is down.</li></ul>"
####################
#                  #
#    Exercise 8    #
#                  #
####################
text_list <- function(x) {
  n <- length(x)
  if (n <= 1) {
    return(x)
  }
  paste(paste(x[-n], collapse = ", "), "and", x[n])
}
films <- c("The Shawshank Redemption", "The Godfather", "The Godfather: Part II", "The Dark Knight", "12 Angry Men", "Schindler's List")

sprintf("The top ranked films on imdb.com are %s", text_list(films))
## [1] "The top ranked films on imdb.com are The Shawshank Redemption, The Godfather, The Godfather: Part II, The Dark Knight, 12 Angry Men and Schindler's List"
####################
#                  #
#    Exercise 9    #
#                  #
####################
perc <- function(x, dp) {
  sprintf("%.*f%%", dp, x*100)
}
input <- 0.921313
perc(input, 2)
## [1] "92.13%"
####################
#                  #
#    Exercise 10   #
#                  #
####################

perc2 <- function(x, dp) {
  p <- sprintf("%.*f%%", dp, x*100)
  if(any(nchar(p) > 10)) stop("Too long percentage")
  sprintf("%10s", p)
}

set.seed(1)
cat(perc2(rnorm(10), 1), sep="\n")
##     -62.6%
##      18.4%
##     -83.6%
##     159.5%
##      33.0%
##     -82.0%
##      48.7%
##      73.8%
##      57.6%
##     -30.5%
perc2(999, 4)
## Error in perc2(999, 4): Too long percentage



Programmatically Creating Text Outputs in R: Exercises

In the age of Rmarkdown and Shiny, or when making any custom output from your data, you want your output to look consistent and neat. Also, when writing your output, you often want it to obtain a specific (decorative) format defined by the html or LaTeX engine. These exercises are an opportunity to refresh our memory on functions, such as paste, sprintf, formatC and others that are convenient tools to achieve these ends. All of the solutions rely partly on the ultra flexible sprintf(), but there are no-doubt many ways to solve the exercises with other functions. Feel free to share your solutions in the comment section.

Example solutions are available here.

Exercise 1

Print out the following vector as prices in dollars (to the nearest cent):
c(14.3409087337707, 13.0648270623048, 3.58504267621646, 18.5077076398145,
16.8279241011882)
. Example: $14.34

Exercise 2

Using these numbers, c(25, 7, 90, 16), make a vector of filenames in the following format: file_025.txt. Left pad the numbers so they are all three digits.

Exercise 3

Actually, if we are only dealing with numbers less than one hundred, file_25.txt would have been enough. Change the code from the last exercise so that the padding is pro-grammatically decided by the biggest number in the vector.

Exercise 4

Print out the following haiku on three lines, right aligned, with the help of cat: c("Stay the patient course.", "Of little worth is your ire.", "The network is down.").

Exercise 5

Write a function that converts a number to its hexadecimal representation. This is a useful skill when converting bmp colors from one representation to another. Example output:

      tohex(12)
      [1] "12 is c in hexadecimal"

Exercise 6

Take a string and pro-grammatically surround it with the html header tag h1.

Exercise 7

Back to the poem from exercise 4, let R convert to html unordered list so that it would appear like the following in a browser:

  • Stay the patient course
  • Of little worth is your ire
  • The network is down

Exercise 8

Here is a list of the current top 5 movies on imdb.com in terms of rating c("The Shawshank Redemption", "The Godfather", "The Godfather: Part II", "The Dark Knight", "12 Angry Men", "Schindler's List"). Convert them into a list compatible with the written text.

Example output:

[1] "The top ranked films on imdb.com are The Shawshank Redemption, The Godfather, The Godfather: Part II, The Dark Knight, 12 Angry Men and Schindler's List"

Exercise 9

Now, you should be able to solve this quickly: write a function that converts a proportion to a percentage that takes as input number of decimal places. An input of 0.921313 and 2 decimal places should return "92.13%".

Exercise 10

Improve the function from the last exercise so that the percentage consistently takes 10 characters by doing some left padding. Raise an error if the percentage already happens to be longer than 10.

(Image by Daniel Friedman).