Robert I. Kabacoff - R in action
.pdf306 |
CHAPTER 12 Resampling statistics and bootstrapping |
The first task is to write a function for obtaining the R-squared value:
rsq <- function(formula, data, indices) { d <- data[indices,]
fit <- lm(formula, data=d) return(summary(fit)$r.square)
}
The function returns the R-square value from a regression. The d <- data[indices,] statement is required for boot() to be able to select samples.
You can then draw a large number of bootstrap replications (say, 1,000) with the following code:
library(boot)
set.seed(1234)
results <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp)
The boot object can be printed using
> print(results)
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ wt + disp)
Bootstrap Statistics : |
|
|
original |
bias |
std. error |
t1* 0.7809306 |
0.01333670 |
0.05068926 |
and plotted using plot(results). The resulting graph is shown in figure 12.2.
In figure 12.2, you can see that the distribution of bootstrapped R-squared values isn’t normally distributed. A 95 percent confidence interval for the R-squared values can be obtained using
> boot.ci(results, type=c("perc", "bca")) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates
CALL : |
|
|
|
boot.ci(boot.out |
= results, type = c("perc", "bca")) |
||
Intervals : |
|
|
|
Level |
Percentile |
BCa |
|
95% |
( 0.6838, |
0.8833 ) |
( 0.6344, 0.8549 ) |
Calculations and |
Intervals on Original Scale |
Some BCa intervals may be unstable
You can see from this example that different approaches to generating the confidence intervals can lead to different intervals. In this case the bias adjusted interval is
|
|
|
Bootstrapping with the boot package |
307 |
|
|
Histogram of t |
|
|
||
8 |
|
|
|
0.90 |
|
|
|
|
|
0.85 |
|
6 |
|
|
|
|
|
|
|
|
|
0.80 |
|
Density 4 |
|
|
|
t* 0.75 |
|
|
|
|
|
0.70 |
|
2 |
|
|
|
|
|
|
|
|
|
0.65 |
|
0 |
|
|
|
0.60 |
|
0.6 |
0.7 |
0.8 |
0.9 |
−3 −2 −1 0 1 |
2 3 |
|
|
t* |
|
Quantiles of Standard Normal |
Figure 12.2 Distribution of bootstrapped R-squared values
moderately different from the percentile method. In either case, the null hypothesis H0: R-square = 0, would be rejected, because zero is outside the confidence limits.
In this section, we estimated the confidence limits of a single statistic. In the next section, we’ll estimate confidence intervals for several statistics.
12.6.2 Bootstrapping several statistics
In the previous example, bootstrapping was used to estimate the confidence interval for a single statistic (R-squared). Continuing the example, let’s obtain the 95 percent confidence intervals for a vector of statistics. Specifically, let’s get confidence intervals for the three model regression coefficients (intercept, car weight, and engine displacement).
First, create a function that returns the vector of regression coefficients:
bs <- function(formula, data, indices) { d <- data[indices,]
fit <- lm(formula, data=d) return(coef(fit))
}
Then use this function to bootstrap 1,000 replications:
|
Summary |
309 |
Intervals : |
|
|
Level |
BCa |
|
95% |
(-5.66, -1.19 ) |
|
Calculations and Intervals on Original Scale |
|
|
> boot.ci(results, type="bca", index=3) |
|
|
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS |
|
|
Based on 1000 bootstrap replicates |
|
|
CALL : |
|
|
boot.ci(boot.out = results, type = "bca", index = 3) |
|
|
Intervals : |
|
|
Level |
BCa |
|
95% |
(-0.0331, 0.0010 ) |
|
Calculations and Intervals on Original Scale
NOTE In the previous example, we resampled the entire sample of data each time. If we assume that the predictor variables have fixed levels (typical in planned experiments), we’d do better to only resample residual terms. See Mooney and Duval (1993, pp. 16–17) for a simple explanation and algorithm.
Before we leave bootstrapping, it’s worth addressing two questions that come up often:
■How large does the original sample need to be?
■How many replications are needed?
There’s no simple answer to the first question. Some say that an original sample size of 20–30 is sufficient for good results, as long as the sample is representative of the population. Random sampling from the population of interest is the most trusted method for assuring the original sample’s representativeness. With regard to the second question, I find that 1,000 replications are more than adequate in most cases. Computer power is cheap and you can always increase the number of replications if desired.
There are many helpful sources of information on permutation tests and bootstrapping. An excellent starting place is an online article by Yu (2003). Good (2006) provides a comprehensive overview of resampling in general and includes R code. A good, accessible introduction to the bootstrap is provided by Mooney and Duval (1993). The definitive source on bootstrapping is Efron and Tibshirani (1998). Finally, there are a number of great online resources, including Simon (1997), Canty (2002), Shah (2005), and Fox (2002).
12.7 Summary
In this chapter, we introduced a set of computer-intensive methods based on randomization and resampling that allow you to test hypotheses and form confidence intervals without reference to a known theoretical distribution. They’re particularly valuable when your data comes from unknown population distributions, when there are serious
310 |
CHAPTER 12 Resampling statistics and bootstrapping |
outliers, when your sample sizes are small, and when there are no existing parametric methods to answer the hypotheses of interest.
The methods in this chapter are particularly exciting because they provide an avenue for answering questions when your standard data assumptions are clearly untenable, or when you have no other idea how to approach the problem. Permutation tests and bootstrapping aren’t panaceas, though. They can’t turn bad data into good data. If your original samples aren’t representative of the population of interest, or are too small to accurately reflect it, then these techniques won’t help.
In the next chapter, we’ll consider data models for variables that follow known, but not necessarily normal, distributions.
Part 4
Advanced methods
In this final section, we consider advanced methods of statistical and graphical analysis to round out your data analysis toolkit. Chapter 13 expands on the regression methods in chapter 8 to cover parametric approaches to data that aren’t normally distributed. The chapter starts with a discussion of the generalized linear model, and then focuses on cases where we’re trying to predict an outcome variable that’s either categorical (logistic regression) or a count (poisson regression).
Dealing with a large number of variables can be challenging, due to the complexity inherent in multivariate data. Chapter 14 describes two popular methods for exploring and simplifying multivariate data. Principal components analysis can be used to transform a large number of correlated variables into a smaller set of composite variables. Factor analysis consists of a set of techniques for uncovering the latent structure underlying a given set of variables. Chapter 14 provides step-by-step instructions for carrying out each.
More often than not, researchers must deal with incomplete datasets.Chapter 15 considers modern approaches to the ubiquitous problem of missing data values. R supports a number of elegant approaches for analyzing datasets that are incomplete for various reasons. Several of the best approaches are described here, along with guidance around which ones to use, and which ones to avoid.
Chapter 16 completes our discussion of graphics with presentations of some of R’s most advanced and useful approaches to visualizing data. This includes visual representations of complex data using the lattice package, and an introduction to the new, and increasingly popular, ggplot2 package. The chapter ends with a review of packages that provide functions for interacting with graphs in real-time.
After completing part 4, you will have the tools to manage a wide range of complex data analysis problems. This includes modeling non-normal outcome variables, dealing with large numbers of correlated variables, and handling messy and incomplete data. Additionally, you will have the tools to visualize complex data in useful, innovative and creative ways.