This is the third part of a series surrounding survival analysis. For part 1, click here. For part 2, click here.

This particular post follows the final part of fitting a Cox proportional hazards model; residual checking and model validation.

Solutions can be found for these exercises here.

**Exercise 1**

Load the survival and survminer libraries. Build our previously derived final Cox model.

**Exercise 2**

Calculate Martingale residuals and build a dataframe of these residuals with a second index column.

**Exercise 3**

Plot your Martingale residuals now.

**Exercise 4**

In a similar way to above, calculate deviance residuals, incorporate them into your residual data frame, and plot.

**Exercise 5**

Calculate the linear predictors of your final model and plot against your deviance residuals.

You may need to use ggrepel here to handle label positioning.

**Exercise 6**

Build a new dataframe containing dfbeta values for each variable within your final model.

**Exercise 7**

Using a for loop, plot each of these dfbeta residuals.

**Exercise 8**

With graphical residual plots created, formally test that the PH assumption has been met for your final model.

**Exercise 9**

Plot this now to graphically confirm your test.

**Exercise 10**

With all residuals and model checking complete, visualise your final survival curve generated by your model.

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Density-Based Clustering Exercises
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

This tutorial concerns itself with MLE calculations and bootstrapping.

Answers to the exercises are available here.

**Exercise 1**

Set a seed to 123 and create the following dataframe:

lifespans = data.frame(index = 1:200, lifespans = rgamma(200, shape = 2, rate = 1.2)*50)

It is thought that this data of bacteria’s lifespan may follow a gamma distribution. Confirm this assumption through an exploratory plot.

**Exercise 2**

Using an appropriate function from the MASS library, evaluate the log-likelihood for this dataset.

**Exercise 3**

From this evaluation, also obtain MLEs for the shape and rate parameters.

**Exercise 4**

Obtain the standard errors for the shape parameter.

**Exercise 5**

Use a wald test to calculate a test statistic to determine if the shape MLE differs from 2 at the 5% level.

**Exercise 6**

Generate an exact p-value for this test.

**Exercise 7**

The goal parameters of our bootstrap are the variance. From the new dataset below, calculate the variance.

durations = c(35.8, 33.4, 34.9, 17.9, 35.6, 10.3, 14.9, 28.3, 39.2, 25.4, 23.4, 7.1, 38.9, 9.2, 8.1)

**Exercise 8**

Create a single bootstrapped sample and calculate the variance from this.

**Exercise 9**

Turn your solution from step 8 into a for loop, generating 100 bootstrapped samples and test statistics.

**Exercise 10**

Calculate a 95% confidence interval for your bootstrapped test statistic.

- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-7)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

The skill in conducting this sort of work is being able to identify an appropriate distribution on which to model the question and test accordingly. Conveniently, within R, the syntax for conducting the test and drawing distributional samples is very uniform. The aim of this tutorial is to expose users less familiar with the statistical theory to some of the more common distributions and their applications.

Answers to the exercises are available here.

**Exercise 1**

Set a seed to 123 and plot a histogram of 1000 draws from a normal distribution with mean 10, standard deviation 2.

**Exercise 2**

Using a QQ plot. Assess the normality of your previously simulated draws.

**Exercise 3**

Using a t-test, test for a difference in means between your samples. 1000 samples of the Student’s t-distribution, 10 degrees of freedom, and a delta value of 9. Report on your p-value and its significance at the 5% level.

**Exercise 4**

Rewrite your t-test, testing now if your normal samples have a greater mean than your samples from the Student’s t-distribution. Report on your new p-value.

**Exercise 5**

Putting these skills together now, calculate a two-sided t-test of equal means from two normal distributions. The first of mean 1, standard deviation 0.5, the second of mean 0.9, a standard deviation of 1. Hint: A function may become useful here.

**Exercise 6**

Replicate this t-test 1000 times and test for a standard uniform distribution, using a QQ plot.

**Exercise 7**

Moving onto other distributions now. The probability of rolling a score greater than 3 on a loaded die is 0.75. What is the probability of rolling exactly 5 scores greater than 3 from 10 rolls of the die?

**Exercise 8**

Building on this, what is the probability of rolling a score greater than 3, less than 5 times, from 10 rolls?

**Exercise 9**

On average, a cashier serves 50 people per hour in their shop. What is the probability that they serve 60 people or more during one hour?

**Exercise 10**

For the cashier serving, each transaction takes 50 seconds on average. What is the probability of the transaction being completed in less than 30seconds?

A linear model is an explanation of how a continuous response variable behaves, dependent on a set of covariates or explanatory variables. Whilst often insufficient to explain complex problems, linear models do present underlying skills, such as variable selection and diagnostic examinations. Therefore, a worthwhile introduction to statistical regression techniques.

In this tutorial, we’ll be creating a couple of linear models and comparing the performance of them on the Boston Housing dataset. This tutorial will require caret and mlbench to be installed and you may find ggplot2 and dplyr useful too, though these are not essential.

Solutions to these exercises can be found here.

**Exercise 1**

Load the Boston Housing dataset from the mlbench library and inspect the different types of variables present.

**Exercise 2**

Explore and visualize the distribution of our target variable.

**Exercise 3**

Explore and visualize any potential correlations between medv and the variables crim, rm, age, rad, tax and lstat.

**Exercise 4**

Set a seed of 123 and split your data into a train and test set using a 75/25 split. You may find the caret library helpful here.

**Exercise 5**

We have seen that crim, rm, tax, and lstat could be good predictors of medv. To get the ball rolling, let us fit a linear model for these terms.

**Exercise 6**

Obtain an r-squared value for your model and examine the diagnostic plots found by plotting your linear model.

**Exercise 7**

We can see a few problems with our model immediately with variables such as 381 exhibiting a high leverage, a poor QQ plot in the tails a relatively poor r-squared value.

Let us try another model, this time transforming MEDV due to the positive skewness it exhibited.

**Exercise 8**

Examine the diagnostics for the model. What do you conclude? Is this an improvement on the first model?

One assumption of a linear model is that the mean of the residuals is zero. You could try and test this.

**Exercise 9**

Create a data frame of your predicted values and the original values.

**Exercise 10**

Plot this to visualize the performance of your model.

The second part of this series focuses on more complex and insightful methods through the semi-parametric Cox Proportional Hazards model. Through a Cox Proportional Hazards model, it is possible to model covariates in a semi-parametric fashion. The advantage of this modeling strategy is that it makes modeling the survival times possible, without knowing or specifying the underlying distribution.

Solutions to these exercises can be found here. It should be noted that these solutions are quite verbose with the intention of breaking down each stage in the modeling process. In real life, you will probably want to use lapply and functions to speed up the variable selection stage.

**Exercise 1**

Before any modeling can commence, let us just test a few variables to get a feel for their effects on survival times. Create survival objects for sex, ph.karno, and wt.loss. Hint: You’ll need to group wt.loss.

**Exercise 2**

Plot these using Survminer to look for differences in the group’s survival curves.

**Exercise 3**

To check that these three variables adhere to the proportional hazards assumption, we must plot a log-cumulative hazard plot for each of them. Within each plot, we are looking for “roughly” parallel lines between each group. If this criterion is met, then the variables adhere to the PH assumption and can then be modeled through a Cox PH model.

**Exercise 4**

With no huge differences in parallelism, we can begin model building. First, just time and status. This will be known as the null model.

**Exercise 5**

Calculate the -2*log-likelihood for the null model. This will be our baseline comparison value when fitting terms.

**Exercise 6**

Now fit a model with sex, ph.karno, and wt.loss individually and obtain -2*log-likelihood values for these models.

**Exercise 7**

Conduct a Chi-Squared test now on the change in -2*log-likelihood values for each of the single term models, ensuring you have the correct degrees of freedom specified.

**Exercise 8**

Given that every term is significant indicates that they are needed in the model, as they introduce a significant amount of information. The next step is to fit them in the presence of each other. To do this, fit a saturated model and remove each term to test for a significant change in -2*log-likelihood values.

**Exercise 9**

With every term being significant, one last thing to check is for any interactions. Feel free to try a few and test for a significant change in deviance compared to the saturated model.

**Exercise 10**

Write out your final Cox Proportional Hazards model.

Through an input layer, one or more hidden layers, and an output layer, a neural network works by connecting up a series of neurons with weights assigned to each connection. As each connection is activated, a calculation is performed on the connection before passing through an activation function at each level of the hidden layers. Commonly, these activation functions are either the RELU, sigmoid or tanh. Their purpose is usually to determine whether the next layer should be activated.

Within this tutorial, we’re going to develop a very simple classification neural network on the commonly used iris dataset. Before starting, you should install the neuralnet, ggplot2, dplyr, and reshape2 libraries. Solutions to these exercises are available here.

**Exercise 1**

Load the neuralnet, ggplot2, and dplyr libraries, along with the iris dataset. Set the seed to 123.

**Exercise 2**

Explore the distributions of each feature present in the iris dataset. Feel free to get creative here. I myself opted for a violin plot.

**Exercise 3**

Convert your observation class and Species into one hot vector.

**Exercise 4**

Write a generic function to standardize a column of data.

**Exercise 5**

Now standardize the predictors. Hint: lapply will be useful here.

**Exercise 6**

Combine your one hot labels and standardized predictors.

**Exercise 7**

Define your formula that your neuralnet will be run on. You’ll need to use the as.formula function here.

**Exercise 8**

Create a neural network object, now using the tanh function and two hidden layers of size 16 and 12. You will also need to tell the neural network that you’re performing a classification algorithm here, not regression. You’ll want to refer to the neuralnet documentation as to how to define this.

**Exercise 9**

Plot your neural network.

**Exercise 10**

Using the compute function and your neural network object’s net.result attribute, calculate the overall accuracy of your neural network.

- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-8)
- Survival Analysis: Exercises (Part 2)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Using R’s survival library, it is possible to conduct very in-depth survival analysis’ with a huge amount of flexibility and scope of analysis. The only downside to conducting this analysis in R is that the graphics can look very basic, which, whilst fine for a journal article, does not lend itself too well to presentations and posters. Step in the brilliant survminer package, which combines the excellent analytical scope of R with the beautiful graphics of GGPlot.

In this tutorial, we will use both the survival library and the survminer library to produce Kaplan-Meier plots and analyze log-rank tests. Solutions are available here.

**Exercise 1**

Load the lung data set from the survival library and re-factor the status column as a factor.

**Exercise 2**

Calculate the percentage of censored observations.

**Exercise 3**

Create a basic survival object exploring the occurrence of events.

**Exercise 4**

Print this object and plot it to graphically investigate this.

**Exercise 5**

Now install and load the survminer library and plot your survival object using a GGPlot graphic.

**Exercise 6**

Create a new survival object, stratifying the survival times now by gender.

**Exercise 7**

Plot this new age-stratified survival object and comment on your observations.

**Exercise 8**

Form a set of hypothesis’ to formally test survival times between males and females.

**Exercise 9**

Compute a log-rank test and report on the p-value calculated, in terms of the previously formed hypothesis.

**Exercise 10**

Investigate how the daily standard of one’s life affects their survival times. How many patients scored a 3 and what could be done with those individuals scoring 3?

Following on from last time, this tutorial will focus on more advanced graph techniques and existing algorithms such as Dijkstra’s algorithm that can be used to draw real meaning from graphs.

This is part 2 in the series of iGraph tutorials, for part 1, click here.

When completing these tutorials be sure to read up on the algorithms you are implementing; Wikipedia is usually a good place to start. In practice, when the graphs are much larger in size, complete solutions to these problems are not possible and instead, we seek the closest approximation.

Solutions are available here.

**Exercise 1**

Copy the following code into your R script to generate data for a bipartite graph.

—

matches <- data.frame(name = rep(c(“Jerry”, “Lilly”, “Karl”, “Jenny”), each = 4),

subject = rep(c(“Maths”, “English”, “Biology”, “French”), 4),

weight = c(81, 78, 24, 58, 76, 60, 62, 83, 35, 59, 50, 56, 72, 90, 88, 86))

g <- graph_from_data_frame(matches, directed = FALSE)

—

From this, assign each set of nodes a unique type, 1 or 2

**Exercise 2**

Now plot your graph, clearly showing the two sides to the bipartite graph.

**Exercise 3**

Determine the minimum colouring needed so that no two adjacent vertices are the same colour

**Exercise 4**

Algorithmically, find the best matching between student and subject based upon their respective mark in each subject.

**Exercise 5**

Add three additional nodes to your graph for Becky, Ben and Mark. Connect these up to Maths, French and Biology respectively.

**Exercise 6**

Calculate a spectrum or adjacency matrix for your new graph.

**Exercise 7**

Plot your eigenvalues and visually determine the rank of your adjacency matrix.

**Exercise 8**

Find the size of independent vertex set. What does number mean?

**Exercise 9**

Calculate the k, such that the graph is k-vertex connected

**Exercise 10**

Calculate the k, such that the graph is k-edge connected

Graph Theory, or network analysis as it is often called, is the mathematical portrayal of a series of edges and vertices. To contextually picture a network, think of each node being an individual on Facebook, and an edge being present between two individuals indicating the two are friends on Facebook. Through computational advancements, in-depth analysis is now possible on large social networks, with applications appearing in fields such as biology, sociology, and ecology.

The uses of graph theory are endless. Within the subject domain sit many types of graphs, from connected to disconnected graphs, trees, and cyclic graphs. Before working through these exercises, it may be useful to quickly familiarize yourself with some basic graph types here if you are not already mindful of them.

The aim of this introductory tutorial is to familiarize the reader with the iGraph package by creating some simple graphs, classifying types of graphs, and reporting some simple statistics from example graphs. It is important to understand the methods and functions being called from the iGraph package – the built-in documentation here are extensive and well documented.

Answers to the exercises are available here.

**Exercise 1**

Load the iGraph library and create an undirected star graph with 5 nodes and plot the graph.

**Exercise 2**

Add edges between the following node pairs:

<

- 8 and 5
- 6 and 3
- 6 and 10
- 5 and 2
- 4 and 8
- 2 and 7
- 6 and 5
- 7 and 9

**Exercise 3**

Re-plot the graph, this time with a circular layout, ordered by vertex number. Hint: use the layout_in_circle function.

**Exercise 4**

Computationally verify that the graph is connected.

**Exercise 5**

Is the graph bipartite? If not, explain why.

**Exercise 6**

Find the graph’s average degree. Do not use iGraph’s built-in function.

**Exercise 7**

Calculate the graph’s diameter.

**Exercise 8**

Find the size of each clique in the graph. Hint: Use the sapply function.

**Exercise 9**

Make a new plot of the graph, this time with the node size being relative to the nodes closeness, multiplied by 500.

**Exercise 10**

Color the nodes of the graph: even nodes blue, odd nodes red. Hint: Using an if-else statement will make this more concise.