Today we will focus on generating random numbers from some of the built-in distributions in R as well as using the sample() function to obtain random samples from a given population.
In the next posts, we will use some of these functions to create our own data sets to practice, and we will build upon some of the concepts discussed here.
So… what are we waiting for?
For this set of exercises, we will make use the following functions: runif(), rnorm(), rbinom(), rpois(), sample() and set.seed().
In stark contrast with my ex-GF, R is always willing to tell you exactly what it wants, so don’t be afraid of using ‘?’ (e.g.?sample) to get documentation for each function.
Give it a go to the exercises, and compare your solutions with mine. Please feel free to comment if you have different ideas of how to solve the exercises, I am eager to hear what you come up with!
Quick! Come up with a random number! And another one! And another one!…And one more! And so on until you reach 100 different random numbers! Uff that doesn’t sound so easy, right? And after you have gone through all that trouble of generating a list of 100 numbers, are you sure they are actually random? Difficult to tell! Human brains are incredibly terrible when it comes to dealing with randomness. But don’t worry! R has your back!
Let’s warm up first. Have R generate a vector containing 100 completely random numbers, from 1 to 10
Easy peasy, right?… But let’s kick it up a notch… suppose you don’t want so many floating numbers in your life. Try to generate a vector of random integer numbers.
We will discuss the following concepts in more depth in future sets, particularly when we talk about summary measures; but for now, it suffices to say that variables (i.e. the values obtained from a sample) can be broadly classified into categorical or numeric. Categorical variables can be further divided into three different subtypes, one of which is binary, such as yes/no, live/death, or face/tails when flipping a coin.
One excellent example of a random experiment that gives binary outcomes (variables, e.g. success/failure) is a Bernoulli trial, which can be achieved in R by using the rbinom() function.
Create a vector with the outcomes of 20 Bernoulli trials that simulate the toss of a fair coin (i.e. p= 0.50 for tails and p= 0.50 ). Just for the sake of immersion, we will consider 0s as failures and 1s as successes.
Excellent! I bow to your skills. Now let’s kick it up a notch… can you figure out how to create a similar simulation using the sample() function?
But perhaps a binary existence of 0s and 1s is not for you. You want the excitement of numeric variables, but you don’t wish to deal with those pesky floating numbers. The fabulous world of discrete numeric variables is for you then!
Discrete numeric variables, also known as counts, are basically integers that represent the number of events during a determined interval (e.g. time or space). Typical examples of counts in biostatistics include the number of births, the number of deaths, the number of individuals, etc. Just remember, the events represented by this type of numeric variable are always integers (i.e. you cannot have a quarter or half death… unless you are Schwarzenegger in Terminator!)
Let’s create a random Poisson distribution (pronounced like “pwasö”) to simulate the occurrence of births in the next 20 years in the (fictitious) country of “Los Cocos”, that has an annual mean birth rate of 100.
Pro-tip: lambda refers to the average rate.
Excellent! You have the eternal (fictitious) gratitude of the noble (fictitious) country of “Los Cocos”.
So now you are feeling a little bit more daring, and you have decided to “man-up” (or “woman up”, hey! I am no sexist!) and deal with all those seemingly infinite (well in theory they actually are) decimal numbers. Well, my dear fellow, a continuous, normal distribution may just be right for you!
Let’s create a vector containing 100 random values of weight measurements with a mean of 70kg and a standard deviation of 15 kg.
Sweet! Keep up that good work!
There are several other built-in distributions in R…and so many more you can get from CRAN, but these are the “work-horses” of basic stats, as we are going to see in the upcoming sets. Keep them fresh in your mind, because they are going to make a comeback, just like Alf…in pog form.
Now, let’s simulate a sample of 10 trials of throwing a dice. (for those that have been living in a cave, a dice has 6 possible, mutually exclusive, outcomes that range from 1 to 6)
Brilliant Job! Let’s kick it up a notch…let’s create an “unfair” dice that has a 0.34 probability of resulting in a 6, and a probability of 0.16 for each of the other outcome.
Kind of cool, huh? If only we could bring with us our recently created R dice to the casino…oh well…
Randomised Controlled Trials are a type of medical experiment, where the eligible participants are randomly assigned (allocated) to one of the two (or more) branches of the study. A randomised controlled trial is considered the gold standard of clinical trials. In these studies, randomisation helps to control for confounding factors, and evenly distribute prognostic factors across groups.
Lets (coarsely) randomly select 10 of the participants in x to be our intervention group.
Excellent…now that we have our victims…err…intervention group let’s kick it up a notch and create a data frame that lists the intervention and control groups. Keep in mind that we are dealing with randomisation right now, so it is ok if our (hypothetical) study is not blinded.
Pro-tip: You may want to create an index to store the randomised individuals rows, and then just use an index to subset x.
Let’s apply all this to a real database. For this exercise you can use any database you may have handy, or download the IST trial database, which was the one I employed when writing this exercise. You can either download it manually, or let R do it for you (isn’t R the best wingman ever, or what?)
To let R download the database, just run this:
128/IST%20dataset%20supp1%20%281%29%20trials%202011.csv?sequence=4&isAllowed=y”, destfile = “stroke.csv”)
And remember to read the csv file with:
Wow! 19,435 cases! That is certainly a decent sample size. But perhaps we are not interested in all of those cases, and we just want a subset. Let’s create a data set that has all 112 observations, but only 200 randomly selected cases from the original database.
Pro-tip: you may want to try to use a similar strategy than the one utilised in the previous exercise. I find using len_seq() and length() very useful in these cases.
I bow to your mastery of R-fu.
All this talk about randomness and random samples may be having its toll on you, and it may have caused you to develop some philosophical/existential doubts, so let’s end these exercises in a more “stable” note. Let’s create replicable random numbers.
Say what!?!? Well, it happens that sometimes you may write an R script and you want it to choose randomly a sample of cases from a larger database (let’s use the IST database again). You perform your analysis, and everything looks nice and dandy…but then you run your script again and… merciful Zeus! The values are all different because you have just taken another sample at random! You feel like your sanity starts to fade away…but before you pick up that axe, and start running naked à la Christian Bale in American psycho, just take a deep breath and remember, R has you covered.
Create a dataset by randomly selecting cases from the IST database, but that can be replicated every time you run your script.
That is it for now, grasshopper. This was just a warm-up. Tune in next week for some discussion of probability and biostatistics concepts using R. (Yay! that is fun^2!)