Data science enhances people’s decision making. Doctors and researchers are making critical decisions every day. Therefore, it is absolutely necessary for those people to have some basic knowledge of data science. This series aims to help people that are around medical field to enhance their data science skills.

We will work with a health related database the famous “Pima Indians Diabetes Database”. It was generously donated by Vincent Sigillito from Johns Hopkins University. Please find further information regarding the dataset here.

This is the eighth part of the series and it aims to cover partially the subject of Inferential statistics.

Researchers rarely have the capability of testing many patients,or experimenting a new treatment to many patients, therefore making inferences out of a sample is a necessary skill to have. This is where inferential statistics comes into play.

In more detail, in this part we will go through the hypothesis testing for testing the normality of distributions(Shapiro–Wilk test, Anderson–Darling test.), the existence of outliers(Grubbs’ test for outliers). We will also cover the case that normality assumption doesn’t hold and how to deal with it(<a href="https://en.wikipedia.org/wiki/Rank_test" Rank tests). Finally we will do a brief recap of the previous exercises on inferential statistics.

Before proceeding, it might be helpful to look over the help pages for the `hist`

, `qqnorm`

, `qqline`

, `shapiro.test`

, `ad.test`

, `grubbs.test`

, `wilcox.test`

.

Moreover please load the following libraries.

`install.packages("ggplot2")`

`library(ggplot2)`

`install.packages("nortest")`

`library(nortest)`

`install.packages("outliers")`

`library(outliers)`

Please run the code below in order to load the data set and transform it into a proper data frame format:

`url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"`

`data <- read.table(url, fileEncoding="UTF-8", sep=",")`

`names <- c('preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class')`

`colnames(data) <- names`

`data <- data[-which(data$mass ==0),]`

Moreover run the chunk below in order to generate the samples that we will test on this set of exercises.

`f_1 <- rnorm(28,29,3)`

`f_2 <- rnorm(23,29,6)`

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1

Plot an histogram of the variable `pres`

.

Exercise 2

Plot the QQ-plot with a QQ-line for the variable `pres`

.

Exercise 3

Apply a Shapiro-Wilk normality test for the variable `pres`

.

Exercise 4

Apply a Anderson-Darling normality test for the variable `pres`

.

Exercise 5

What is the percentage of `data`

that passes a normality test?

This might be a bit challenging, consider using the apply function.

Exercise 6

Construct a boxplot of `pres`

and see whether there are outliers or not.

Exercise 7

Apply a Grubb’s test on the `pres`

to see whether the variable contains outlier values.

Exercise 8

Apply a two-sided Grubb’s test on the `pres`

to see whether the variable contains outlier values.

Exercise 9

Suppose we test a new diet on a sample of 14 people from the candidates (take a random sample from the set) and after the diet the average mass was 29 with standard deviation of 4 (generate 14 normal distributed samples with the properties mentioned before). Apply Wilcoxon signed rank test for the `mass`

variable before and after the diet.

Exercise 10

Check whether the positive and negative candidates have the same distribution for the `pres`

variable. In order to check that, apply a Wilcoxon rank sum test for the `pres`

variable in respect to the `class.fac`

variable.

## Leave a Reply