`lm`

and `glm`

functions to perform several generalized linear models on one dataset.
Since this is a basic set of exercises we will take a closer look at the arguments of these functions and how to take advantage of the output of each function so we can find a model that fits our data.

Before starting this set of exercises, I strongly suggest you look at the R Documentation of `lm`

and `glm`

.

Note: This set of exercises assume that you have a basic understanding of generalized linear models.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

The dataset we will be using contains information from passengers of the Titanic including if they survived or not.

To obtain the data run these lines of code.

if (!'titanic' %in% installed.packages()) install.packages('titanic')

library(titanic)

DATA <- titanic_train[,-c(1,4,9,11)]

**Exercise 1**

**Linear regression**

1. Use `DATA`

to create a linear model using the function `lm`

with the variables Age and Fare as independent variables and Survived as the independent one. Save the regression in an object called `lm_reg`

2. Use the function `glm`

to perform the same task and save the regression in an object called `glm_reg`

**Exercise 2**

If you print any of the previous objects you will realize that there’s not much information about the performance of the models, fortunately `summary`

is a great function to find out more about any statistical model you preform to a dataset. Depending on the model `summary`

will produce different outputs.

- Apply
`summary`

to`lm_reg`

and to`glm_reg`

. You will find a slight difference between both of the outputs, that is because`glm`

is more flexible than`lm`

.

**Exercise 3**

So far we have been assuming (incorrectly) that the dependent variable (`Survived`

) follows a normal distribution and that’s why we have been performing a linear regression. Obviously `Survived`

follows a binomial distribution, there are only two options either the passenger survived (1) or the passenger wasn’t that lucky and he died (0). Since the data has a binomial distribution we should perform a logistic regression, to do this use the function `glm`

to perform a logistic regression using `Age`

and `Fare`

as independent variables and save it in an object called `bin_model`

. Hint: Define the value of the argument `family`

properly.

**Exercise 4**

Inside the family attribute you can always specify a particular link, in case you don’t a default link will be associated depending on the family you chose.

1. To find out the default link associated to a certain family, you can write the family name followed by a parenthesis (Ex. `gaussian()`

. Find the default link associated to the binomial family.

2. Create a probit model with the same variables used in `bin_model`

and save it in an object called `bin_probit_model`

.

**Exercise 5**

Findind the right model requires to compare different models and choose the best, although there are many performance measures, for now we will use the `AIC`

as our measure (smaller AIC are better). This means that `bin_model`

is better than `bin_probit_model`

, so let’s keep working with `bin_model`

.

Until now intercept variable has been part of the models. Create a logistic regression with the same variables but with no intercept.

**Exercise 6**

**Impute data**. If you run the `summary`

function to any of the previous models you will find out that 177 observations have been deleted due to missingness. This happens because the `glm`

function has as default argument `na.acton ="na.omit"`

. This make easier to run a model with messier data, but that is not always great. You want to have full control an understanding of what does the function is doing.

1. There are some missing values in `age`

, replace this values with the median.

2. Update the `glm_model`

with the updated data, specify `na.action='na.fail'`

This will assure us that the dataset has no missing values, otherwise it will show an error.

**Exercise 7**

**Add polynomial independent variables**. Some variables have a quadratic interaction between them and the dependent variable, this can be solved by specifying in the formula of the model a quadratic interaction.

Add a quadratic interaction for the variable `Fare`

into the current model, specified in `glm_model`

**Exercise 8**

**Add categorical variables. ** Add `Sex`

as an independent variable into the current model specified in `glm_model`

. Note that Sex is not a numeric variable.

**Exercise 9**

Now that we have found a good model that fits our data, so it’s time to use the `predict`

function to find how good the model predicts in our own data. Use the function `predict`

to find the prediction of the model in `DATA`

and save it in `Pred.default`

**Exercise 10**

`Pred.default`

shows the predicted values under the link transformation, in this case logit. This is not easily interpretable, to fix this problem we can specify the `type`

of prediction we want.

- Obtain the predictions as probability values.

- Exta: What’s the percentage accuracy of this model if we assigned as died (0) if the predicted probability is less than 0.5 and survived (1) otherwise?

- Probability functions intermediate
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)
- Lets Begin with something sample
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

`MASS`

will be used in this set.
Note: We are going to use random numbers functions and random processes functions in R such as `runif`

. A problem with these functions is that every time you run them you will obtain a different value. To make your results reproducible, you can specify the value of the seed using `set.seed(‘any number’)`

before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random number process.) For this set of exercises, we will use `set.seed(1)`

. Don’t forget to specify it before every exercise that includes random numbers.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

**Generating dice rolls** Using the functions `runif`

and `round`

, simulate the results of 100 dice rolls.

**Exercise 2**

Let’s assume that we want to simulate a game in which we throw an unfair coin (success probability is 0.48) 10 times and you win $10 every time the result is tails and lose $10 when the result is heads. Simulate this game1000 time using `rbinom`

, and find the expected amount of money you will gain or lose in this game using the simulated values.

**Exercise 3**

Simulate an experiment of throwing one dice 30 times using the function `rmultinom`

, and find out how many 6’s are in the simulated sample.

**Exercise 4**

Obtain a vector that shows how many 1’s, 2’s,….6’s were obtained in the previous simulation.

**Exercise 5**

Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with a standard deviation of 0.1. Use `rnorm`

to simulate the height of 1000 people and save it in an object called `heights`

.

a) Plot the density of the simulated values.

b) Generate 10000 values with the same parameters and plot the respective density function on top of the previous plot in red to differentiate it.

This plot will show you how much a sample with 10000 simulations approximate to the real normal distribution.

**Exercise 6**

Find the 90% interval of a population with mean = 1.70 and standard deviation = .1

**Exercise 7**

Simulate 100000 people with height (cm) and weight (kg) using the function `mvrnorm`

with ` mu = c(170, 60) `

and

` Sigma = matrix(c(10,17,17,100), nrow = 2)`

, and save it in an object called `population`

.

Apply the function `summary`

to `population`

to get an idea of the values created.

**Exercise 8**

**Plotting bivariate distribution**. Use the function `kde2d`

to generate a two-dimensional kernel density of the matrix `population`

and plot the values using `persp`

.

**Exercise 9**

**Simulating with a Bayesian approach**. Unlike the frequentist statistics approach, Bayesian statistics assume the parameters of a distribution are a random variable with its own distribution. Let’s simulate a poisson variable.

a) Simulate a gamma variable with shape = 20 and scale = 0.5

b) Simulate using the previous value a poisson random variable

**Exercise 10**

Simulating one variable doesn’t make sense if you want to know the properties of a certain distribution. Repeat the previous simulation but create 100 poisson variables and plot the distribution.

]]>- Lets Begin with something sample
- Probability functions beginner
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

`apply`

, check the R documentation.
Note: We are going to use random numbers functions and random processes functions in R such as `runif`

. A problem with these functions is that every time you run them, you will obtain a different value. To make your results reproducible you can specify the value of the seed using `set.seed(‘any number’)`

before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random number process.) For this set of exercises, we will use `set.seed(1).`

Don’t forget to specify it before every exercise that includes random numbers.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

**Generating dice rolls** Set your seed to 1 and generate 30 random numbers using `runif`

. Save it in an object called `random_numbers`

. Then use the `ceiling`

function to round the values. These values represent rolling dice values.

**Exercise 2**

Simulate one dice roll using the function `rmultinom`

. Make sure `n = 1`

is inside the function, and save it in an object called `die_result`

. The matrix `die_result`

is a collection of 1 one and 5 zeros, with the one indicating which value was obtained during the process. Use the function `which`

to create an output that shows only the value obtained after the dice is rolled.

**Exercise 3**

Using `rmultinom`

, simulate 30 dice rolls. Save it in a variable called `dice_result`

and use `apply`

to transform the matrix into a vector with the result of each dice.

**Exercise 4**

Some gambling games use 2 dice, and after being rolled they sum their value. Simulate throwing 2 dice 30 times and record the sum of the values of each pair. Use `rmultinom`

to simulate throwing 2 dice 30 times. Use the function `apply`

to record the sum of the values of each experiment.

**Exercise 5**

Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with a standard deviation of 0.1. Using `rnorm`

, simulate the height of 100 people and save it in an object called `heights`

.

To get an idea of the values of heights, use the function `summary`

.

**Exercise 6**

90% of the population is smaller than ____________?

**Exercise 7**

Which percentage of the population is bigger than 1.60 m?

**Exercise 8**

Run the following line code before this exercise. This will load a library required for the exercise.

`if (!'MASS' %in% installed.packages()) install.packages('MASS')`

library(MASS)

Simulate 1000 people with height and weight using the function `mvrnorm`

with ` mu = c(1.70, 60) `

and

` Sigma = matrix(c(.1,3.1,3.1,100), nrow = 2) `

**Exercise 9**

How many people from the simulated population are taller than 1.70 m and heavier than 60 kg?

**Exercise 10**

How many people from the simulated population are taller than 1.75 m and lighter than 60 kg?

]]>- Answer probability questions with simulation
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Note: We are going to use random number functions and random process functions in R such as `runif`

, a problem with these functions is that every time you run them you will obtain a different value. To make your results reproducible you can specify the value of the seed using `set.seed(‘any number’)`

before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random numbers). For this set of exercises we will use `set.seed(1)`

, don’t forget to specify it before every random exercise.

Answers to the exercises are available here

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

**Generating random numbers. ** Set your seed to 1 and generate 10 random numbers using `runif`

and save it in an object called `random_numbers`

.

**Exercise 2**

Using the function `ifelse`

and the object `random_numbers`

simulate coin tosses. Hint: If `random_numbers`

is bigger than .5 then the result is head, otherwise is tail.

Another way of generating random coin tosses is by using the `rbinom`

function. Set the seed again to 1 and simulate with this function 10 coin tosses. Note: The value you will obtain is the total number of heads of those 10 coin tosses.

**Exercise 3**

Using the function `rbinom`

to generate 10 unfair coin tosses with probability success of 0.3. Set the seed to 1.

**Exercise 4**

We can simulate rolling a die in R with `runif`

. Save in an object called `die_roll`

1 random number with `min = 0`

and `max = 6`

. This mean that we will generate a random number between 1 and 6.

Apply the function `ceiling`

to `die_roll`

. Don’t forget to set the seed to 1 before calling `runif`

.

**Exercise 5**

Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with an standard deviation of 0.1, using `rnorm`

simulate the height of 100 people and save it in an object called `heights`

.

To get an idea of the values of heights applying the function `summary`

to it.

**Exercise 6**

a) What’s the probability that a person will be smaller or equal to 1.90 m ? Use `pnorm`

b) What’s the probability that a person will be taller or equal to 1.60 m? Use `pnorm`

**Exercise 7**

The waiting time (in minutes) at a doctor’s clinic follows an exponential distribution with a rate parameter of 1/50. Use the function `rexp`

to simulate the waiting time of 30 people at the doctor’s office.

**Exercise 8**

What’s the probability that a person will wait less than 10 minutes? Use `pexp`

**Exercise 9**

What’s the waiting time average?

**Exercise 10**

Let’s assume that patients with a waiting time bigger than 60 minutes leave. Out of 100 patients that arrive to the clinic how many are expected to leave? Use `pexp`

`tm`

, this includes functions designed for this task. There are many applications of text mining, a pretty popular one is the ability to associate a text with his or her author, this was how J.K.Rowling (Harry potter author) was caught publishing a new novel series under an alias. Before proceeding, it might be helpful to look over the help pages for the `nchar`

, `tolower`

, `toupper`

, `grep`

, `sub `

and `strsplit`

. Take at the library `stringr`

and the functions it includes such as `str_sub`

.
Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Before starting the set of exercises run the following code lines :

`if (!'tm' %in% installed.packages()) install.packages('tm')`

library(tm)

txt = system.file("texts", "txt", package = "tm")

ovid = VCorpus(DirSource(txt, encoding = "UTF-8"),

readerControl = list(language = "lat"))

OVID = c(data.frame(text=unlist(TEXT), stringsAsFactors = F))

TEXT = lapply(ovid[1:5], as.character)

TEXT1 = TEXT[[4]]

**Exercise 1**

Delete all the punctuation marks from TEXT1

**Exercise 2**

How many letters does TEXT1 contains?

**Exercise 3**

How many words does TEXT1 contains?

**Exercise 4**

What is the most common word in TEXT1?

**Exercise 5**

Get an object that contains all the words with at least one capital letter (Make sure the object contains each word only once)

**Exercise 6**

Which are the 5 most common letter in the object `OVID`

?

**Exercise 7**

Which letters from the alphabet are not in the object `OVID`

**Exercise 8**

On the `OVID`

object, there is a character from the popular sitcom ‘FRIENDS’ , Who is he/she? There were six main characters (Chandler, Phoebe, Ross, Monica, Joey, Rachel)

**Exercise 9**

Find the line where this character is mentioned

**Exercise 10**

How many words finish with a vowel, how many with a consonant?

]]>- Parallel Computing Exercises: Foreach and DoParallel (Part-2)
- Using factor variables like a pro (Part-2)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

This set of exercises will help you to learn and test your skill with basic arithmetical operations and logic functions. Before proceeding, it might be helpful to look over the help pages for the `**`

, `%/%`

, `%%`

, and the logical operators such as `!=, ==, >=, isTRUE `

.

Answers to the exercises are available here.

**Exercise 1**

**Basic Operations** There are two main different type of interest, simple and compound. To start let’s create 3 variables, S = 100 (initial investment), i1=.02 (annual simple interest), i2=.015 (annual compound interest), n=2 (years that the investment will last).

**Simple Interest** Define a variable called `simple`

equal to S (1 + i1 * n)

**Compound Interest** Define a variable called `compound`

equal to S x (1 + i2)^{n}

**Exercise 2**

It’s natural to ask which type of interest for this values gives more amount of money after 2 years (n = 2). Using logical functions ` <,>, == `

check which variable is bigger between `simple`

and `compound`

**Exercise 3**

Using logical functions ` <,>, ==, |, & `

find out if simple or compound is equal to 120

Using logical functions ` <,>, ==, |, & `

find out if simple and compound is equal to 120

**Exercise 4**

Formulas can deal with vectors, so let’s define a vector and use it in one of the formulas we defined earlier. Let’s define S as a vector with the following values 100, 96. Remember that `c()`

is the function that allow us to define vectors.

Apply to S the simple interest formula and store the value of the vector in `simple`

**Exercise 5**

Using logical functions ` <,>, == `

check if any of the `simple`

values is smaller or equal to `compound`

**Exercise 6**

Using the function `%/%`

find out how many $20 candies can you buy with the money stored in `compound`

**Exercise 7**

Using the function `%%`

find out how much money is left after buying the candies.

**Exercise 8**

Let’s create two new variables, ode defined as `rational=1/3`

and `decimal=0.33`

. Using the logical function `!=`

Verify if this two values are different.

**Exercise 9**

There are other functions that can help us compare two variables.

Use the logical function `==`

verify if `rational`

and `decimal`

are the same.

Use the logical function `isTRUE`

verify if `rational`

and `decimal`

are the same.

Use the logical function `identical`

verify if `rational`

and `decimal`

are the same.

**Exercise 10**

Using the help of the logical functions of the previous exercise find the approximation that R uses for 1/3. Hint: It is not the value that R prints when you define 1/3

]]>`plot`

, `points`

, `abline,`

`title`

, `legend`

,`par (`

including all the arguments), `mfrow`

and `layout`

For this set of exercises you will use the dataset called `cars`

, an R dataset that contains two variables; distance and speed. To load the dataset run the following code line `data(cars)`

.

Answers to the exercises are available here.

**Exercise 1**

a)Load the `cars`

dataset and create a scatterplot of the data.

b)Using the argument `lab`

of the function `plot`

create a new scatterplot where the thickmarks of the x and y axis specify every integer.

**Exercise 2**

The previous plot didn’t showed all the numbers associated to the new thickmarks, so we are going to fix them. Recreate the same plot from the previous question and using the argument `cex.axis`

control the size of the numbers associated to the axes thickmarks so they can be small enough to be visible.

**Exercise 3**

On the previous plot the numbers associated to the y-axis thickmarks aren’t easy to read. Recreate the plot from the last exercise and use the argument `las`

to change the orientation of the labels from vertical to horizontal.

**Exercise 4**

Suppose you want to add two new observations to the previous plot, but you want to identify them on the graph. Using the `points`

function add the new observations to the last plot using red to identify them. The values of the new observation are speed = 23, 26 and dist = 60, 61.

**Exercise 5**

As you could see the previous plot doesn’t show one of the new observations because is out the x-axis range.

a)Create again the plot for the old observations with an x-axis range that includes all the values from 4 to 26.

b)Add the two new observations using the `points`

function.

**Exercise 6**

After running a linear regression to the original data you find out that a = 17.5 and b = 3.93. Using the function `lines`

add the linear regression to the plot using blue and a dashed line.

**Exercise 7**

Using the function `title`

and `expression`

add the following title “Regression: β _{0} = -17.3, β _{1} = -3.93″.

**Exercise 8**

Add to the previous plot a legend on the top left corner that shows which color is assigned to old observations and which one to new ones.

**Exercise 9**

This exercise will test your skills to create more than one plot in the same layout. Using the functions `par`

and `mfrow`

.

Create on the same layout two histograms, one for each column of the `cars`

data.

**Exercise 10**

Using the function `layout`

print on the same layout 3 plots, on the left side a scatterplot of cars, on the top right the histogram of the column speed of the data `cars`

, and on the bottom right an histogram of the column distance.

`length, range, median, IQR`

, `hist`

, `quantile`

, `boxplot`

, and `stem`

functions.
For this set of exercises you will use a dataset called `islands`

, an R dataset that contains the areas of the world’s major landmasses expressed in squared miles. To load the dataset run the following instruction: `data(islands)`

.

Answers to the exercises are available here.

**Exercise 1**

Load the `islands`

dataset and obtain the total number of observations.

**Exercise 2**

Measures of central tendency. Obtain the following statistics of islands

a)Mean

b)Median

**Exercise 3**

Using the function `range`

, obtain the following values:

a)Size of the biggest island

b)Size of the smallest island

**Exercise 4**

Measures of dispersion. Find the following values for islands:

a)Standard deviation

b)The range of the islands size using the function `range`

.

**Exercise 5**

Quantiles. Using the function `quantile`

obtain a vector including the following quantiles:

a) 0%, 25%, 50%, 75%, 100%

b) .5%, 95%

**Exercise 6**

Interquartile range. Find the interquartile range of islands.

**Exercise 7**

Create an histogram of islands with the following properties.

a) Showing the frequency of each group

b) Showing the proportion of each group

**Exercise 8**

Create box-plots with the following conditions

a) Including outiers

b) Without outliers

**Exercise 9**

Using the function `boxplot`

find the outliers of islands. Hint: Notice that the `boxplot`

function does not only creates a plot, but also gives some useful information about the data,

**Exercise 10**

Create a stem and leaf plot of islands

]]>`diag`

, `t`

, `eigen`

, and `crossprod`

functions. If you want further documentation also consider chapter 5.7 from “An Introduction to R”.
Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment.

**Exercise 1**

Consider `A=matrix(c(2,0,1,3), ncol=2)`

and `B=matrix(c(5,2,4,-1), ncol=2).`

a) Find ** A** + ** B**

b) Find ** A** – ** B**

**Exercise 2**

Scalar multiplication. Find the solution for a**A** where `a=3`

and **A** is the same as in the previous question.

**Exercise 3**

Using the the `diag `

function build a diagonal matrix of size 4 with the following values in the diagonal 4,1,2,3.

**Exercise 4**

Find the solution for **Ab,** where **A** is the same as in the previous question and `b=c(7,4).`

**Exercise 5**

Find the solution for **AB,** where ** B** is the same as in question 1.

**Exercise 6**

Find the transpose matrix of **A.**

**Exercise 7**

Find the inverse matrix of **A.**

**Exercise 8**

Find the value of x on **Ax**=**b.**

**Exercise 9**

Using the function ` eigen `

find the eigenvalue for **A**.

**Exercise 10**

Find the eigenvalues and eigenvectors of ** A’A **. **Hint**: Use` crossprod`

to compute ** A’A **.