Lab 3 - Probability Distributions in R
Binomial Distributions
R has four built-in functions for each probability distribution, beginning with a "d", "p", "q", or "r" and followed by the name of the distribution. For example, for a binomial distribution with 10 trials and success probability 1/3,
# "Density" function:
# For a discrete random variable (e.g., binomial):
# probability mass function (pmf), e.g., the probability X = 5.
# For a continuous random variable (e.g., normal):
# probability density function (pdf), e.g., the normal curve y-value at a given x-value.
dbinom(5,10,1/3)
# Cumulative distribution function (cdf): e.g., the probability X <= 5.
pbinom(5,10,1/3)
# Quantile function: The value of X such that, e.g., .25 percent of the
# distribution lies below it, i.e., the inverse of the cdf.
qbinom(.25,10,1/3)
# Random generation: e.g., generate 20 data points from the distribution.
rbinom(20,10,1/3)
dbinom Examples
For a Binomial(6,1/3) random variable named X, compute the probability of 2 successes, Pr(X=2):
> dbinom(2,6,1/3) [1] 0.3292181
Compare to the direct computation using the formula for binomial probabilities: Pr(X = 2) =
((6*5*4*3*2*1)/(2*1*4*3*2*1)) * (1/3)^2 * (2/3)^4
or
factorial(6)/(factorial(2)*factorial(4)) * (1/3)^2 * (2/3)^4
or
choose(6,2) * (1/3)^2 * (2/3)^4
We can compute the probabilities for the whole sample space at once if we like with:
dbinom(0:6,6,1/3)
A nicer way to display those probabilities is in a table:
matrix(c(0:6,dbinom(0:6,6,1/3)),ncol=2, dimnames=list(NULL,c("n","Prob")))
When printing tables like this, it is often better to round off the entries to fewer significant digits:
M <- matrix(c(0:6,dbinom(0:6,6,1/3)),ncol=2, dimnames=list(NULL,c("n","Prob"))) round(M,4)
We can also produce the theoretical histogram for repeated trials of a given binomial experiment. Here is a user-defined function to draw the binomial density "curve", you can paste it in to your R session:
hist.binom <- function(n, p, ...) { # plots a relative frequency histogram of the binomial distribution k <- 0:n p <- dbinom(k, n, p) names(p) <- as.character(0:n) barplot(p,space=0, ...) }
A relative frequency histogram is just a barplot with no space between the bars, where the total area is 1. The height of each bar here the probability associated with that binomial outcome. To use the hist.binom function, you must specify the values of n and p. For example, to get the histogram of a Binomial(6,1/3) distribution, use
hist.binom(6,1/3)
Try experimenting with the color ("col") argument; use help(colors) or colors() to see what colors R knows by name.
hist.binom(6,1/3,col="yellow")
pbinom Examples
For a binomial(6,1/3) random variable X, compute the probability that X is less than 3; in other words, Pr(X <= 2):
pbinom(2,6,1/3)
Compare to summing the density (ie adding up the areas under the binomial histogram:
dbinom(0,6,1/3)+dbinom(1,6,1/3)+dbinom(2,6,1/3)
or
sum(dbinom(0:2,6,1/3))
We can find the probability that we get more than a given number of successes by subtraction: for the probability that X is greater than 2, Pr(X > 2):
1 - pbinom(2,6,1/3)
We could of course compute the probability of that event by summing the individual probabilities:
sum(dbinom(3:6,6,1/3))
Finally, R has another option for this computation:
pbinom(2,6,1/3,lower.tail=FALSE)
The tails of a probability distribution are the values at either end of the range of the random variable. The pbinom function normally assumes that you want the lower tail of the distribution, that is the probability of getting less than or equal to a specified value. The specification "lower.tail=FALSE" tells R to compute the upper tail of the distribution, that is the probability of getting a value greater than the argument. Compare all three methods and make sure that they give the same probabilities. Note: since the Binomial sample space only includes integer values, Pr(X < k) is NOT the same as Pr(X <= k-1).
qbinom Examples
For continuous distributions, the "p" function and the "q" function are inverses (i.e., either one undoes the action of the other: try qnorm(pnorm(0)) or pnorm(qnorm(.5)).) This doesn't work out quite so perfectly for the binomial distribution because of the discrete nature of the sample space. It is too "lumpy." Compare
qbinom(.5,6,1/3)
pbinom(2,6,1/3)
Normal Distributions
For this section of the lab, work through the OpenIntro lab on the normal distribution.
If the data does not load, use the following R command:
bdims <- read.csv("http://www.math.montana.edu/shancock/data/bdims.csv")
Other Probability Distributions
The two main types of random variables are discrete and continuous. A random variable is discrete if you can enumerate all possible values of the variable. A random variable is continuous if it can take any value in an interval of the real line. The R functions for several discrete distributions we may see in this course are:
Discrete Distribution (parameters) |
Scenario | Possible Values | R Function (_ = d, p, q, r) |
X ~ Binomial(n, p) | X = number of "successes" in n independent trials; P("success") = p |
0, 1,..., n |
_binom |
(X1,..., Xc) ~ Multinomial(n, p1,..., pc) | Xi = number of observations in category i in n independent trials; P("category i") = pi, i = 1,..., c |
0, 1,..., n where X1+...+Xc = n |
_multinom |
X ~ Poisson(lambda) | X = count of "rare" events in an interval of time; mean count = lambda |
0, 1, 2,... |
_pois |
X ~ Geometric(p) | X = number of "failures" until first "success"; P("success") = p |
0, 1, 2,... |
_geom |
X ~ Negative Binomial(r, p) | X = number of "failures" until rth "success"; P("success") = p |
0, 1, 2,... |
_nbinom |
X ~ Hypergeometric(N, k, n) | X = number of "successes" in a sample of size n from an population of N items, k of which are "successes" and N-k are "failures" |
max(0, n-(N-k)),..., min(n, k) |
_hyper |
The R functions for several continuous distributions are:
Continuous Distribution (parameters) |
Scenario | Possible Values | R Function (_ = d, p, q, r) |
X ~ Normal(mean, sd) | "bell curve" e.g., approximate distribution of people's heights |
(-infinity, infinity) |
_norm |
X ~ Uniform(min, max) | X = random number chosen from the interval [min, max] | [min, max] |
_unif |
X ~ Exponential(rate) | X = time between events with given rate | [0, infinity) |
_exp |
X ~ Chi-squared(df) | Used in hypothesis testing. | [0, infinity) |
_chisq |
X ~ t(df) | Used in hypothesis testing. | (-infinity, infinity) |
_t |
X ~ F(df1, df2) | Used in hypothesis testing. | [0, infinity) |
_f |
It's a good idea to become familiar with the syntax of the R functions for each distribution
(some of the arguments aren't what you would expect). Try looking up the help files
for several of these functions.