Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

292

CHAPTER 12 Resampling statistics and bootstrapping

But what if you aren’t willing to assume that the sampling distribution of the mean is normally distributed? You can use a bootstrapping approach instead:

1Randomly select 10 observations from the sample, with replacement after each selection. Some observations may be selected more than once, and some may not be selected at all.

2Calculate and record the sample mean.

3Repeat the first two steps 1,000 times.

4Order the 1,000 sample means from smallest to largest.

5Find the sample means representing the 2.5th and 97.5th percentiles. In this case, it’s the 25th number from the bottom and top. These are your 95% confidence limits.

In the present case, where the sample mean is likely to be normally distributed, you gain little from the bootstrap approach. Yet there are many cases where the bootstrap approach is advantageous. What if you wanted confidence intervals for the sample median, or the difference between two sample medians? There are no simple normaltheory formulas here, and bootstrapping is the approach of choice. If the underlying distributions are unknown, if outliers are a problem, if sample sizes are small, or if parametric approaches don’t exist, bootstrapping can often provide a useful method of generating confidence intervals and testing hypotheses.

12.6 Bootstrapping with the boot package

The boot package provides extensive facilities for bootstrapping and related resampling methods. You can bootstrap a single statistic (for example, a median) or a vector of statistics (for example, a set of regression coefficients). Be sure to download and install the boot package before first use:

install.packages("boot")

The bootstrapping process will seem complicated, but once you review the examples it should make sense.

In general, bootstrapping involves three main steps:

1Write a function that returns the statistic or statistics of interest. If there is a single statistic (for example, a median), the function should return a number. If there is a set of statistics (for example, a set of regression coefficients), the function should return a vector.

2Process this function through the boot() function in order to generate R bootstrap replications of the statistic(s).

3Use the boot.ci() function to obtain confidence intervals for the statistic(s) generated in step 2.

Now to the specifics.

The main bootstrapping function is boot(). It has the format

bootobject <- boot(data=, statistic=, R=, ...)

Description
The observed values of k statistics applied to the original data
An R × k matrix, where each row is a bootstrap replicate of the k statistics

Bootstrapping with the boot package

293

The parameters are described in table 12.3.

Table 12.3 Parameters of the boot() function

Parameter

Description

 

 

data

A vector, matrix, or data frame.

statistic

A function that produces the k statistics to be bootstrapped (k=1 if bootstrap-

 

ping a single statistic). The function should include an indices parameter that

 

the boot() function can use to select cases for each replication (see the

 

examples in the text).

R

Number of bootstrap replicates.

...

Additional parameters to be passed to the function that produces the statistic

 

of interest.

 

 

The boot() function calls the statistic function R times. Each time, it generates a set of random indices, with replacement, from the integers 1:nrow(data). These indices are used in the statistic function to select a sample. The statistics are calculated on the sample, and the results are accumulated in bootobject. The bootobject structure is described in table 12.4.

Table 12.4 Elements of the object returned by the boot() function

Element

t0

t

You can access these elements as bootobject$t0 and bootobject$t.

Once you generate the bootstrap samples, you can use print() and plot() to examine the results. If the results look reasonable, you can use the boot.ci() function to obtain confidence intervals for the statistic(s). The format is

boot.ci(bootobject, conf=, type= )

The parameters are given in table 12.5.

Table 12.5 Parameters of the boot.ci() function

Parameter

Description

 

 

bootobject

The object returned by the boot() function.

conf

The desired confidence interval (default: conf=0.95).

type

The type of confidence interval returned. Possible values are norm, basic,

 

stud, perc, bca, and all (default: type="all")

 

 

294

CHAPTER 12 Resampling statistics and bootstrapping

The type parameter specifies the method for obtaining the confidence limits. The perc method (percentile) was demonstrated in the sample mean example. bca provides an interval that makes simple adjustments for bias. I find bca preferable in most circumstances. See Mooney and Duval (1993) for an introduction to these methods.

In the remaining sections, we’ll look at bootstrapping a single statistic and a vector of statistics.

12.6.1Bootstrapping a single statistic

The mtcars dataset contains information on 32 automobiles reported in the 1974 Motor Trend magazine. Suppose you’re using multiple regression to predict miles per gallon from a car’s weight (lb/1,000) and engine displacement (cu. in.). In addition to the standard regression statistics, you’d like to obtain a 95% confidence interval for the R-squared value (the percent of variance in the response variable explained by the predictors). The confidence interval can be obtained using nonparametric bootstrapping.

The first task is to write a function for obtaining the R-squared value:

rsq <- function(formula, data, indices) { d <- data[indices,]

fit <- lm(formula, data=d) return(summary(fit)$r.square)

}

The function returns the R-squared value from a regression. The d <- data[indices,] statement is required for boot() to be able to select samples.

You can then draw a large number of bootstrap replications (say, 1,000) with the following code:

library(boot)

set.seed(1234)

results <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp)

The boot object can be printed using

> print(results)

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ wt + disp)

Bootstrap Statistics :

 

original

bias

std. error

t1* 0.7809306

0.01333670

0.05068926

and plotted using plot(results). The resulting graph is shown in figure 12.2.

 

 

 

 

Bootstrapping with the boot package

 

 

295

 

 

Histogram of t

 

 

 

 

 

 

 

 

 

8

 

 

 

 

0.90

 

 

 

 

 

 

 

 

 

 

 

 

0.85

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.80

 

 

 

 

 

 

Density

4

 

 

 

t*

0.75

 

 

 

 

 

 

 

 

 

 

 

 

0.70

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.65

 

 

 

 

 

 

 

0

 

 

 

 

0.60

 

 

 

 

 

 

 

0.6

0.7

0.8

0.9

 

−3

−2

−1

0

1

2

3

 

 

 

t*

 

 

Quantiles of Standard Normal

Figure 12.2 Distribution of bootstrapped R-squared values

In figure 12.2, you can see that the distribution of bootstrapped R-squared values isn’t normally distributed. A 95% confidence interval for the R-squared values can be obtained using

> boot.ci(results, type=c("perc", "bca")) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates

CALL :

boot.ci(boot.out = results, type = c("perc", "bca"))

Intervals :

 

 

 

 

Level

Percentile

 

BCa

 

95%

( 0.6838,

0.8833 )

(

0.6344,

0.8549 )

Calculations and

Intervals on

Original

Scale

Some BCa intervals may be unstable

You can see from this example that different approaches to generating the confidence intervals can lead to different intervals. In this case, the bias-adjusted interval is moderately different from the percentile method. In either case, the null hypothesis H0: R-square = 0 would be rejected, because zero is outside the confidence limits.

In this section, you estimated the confidence limits of a single statistic. In the next section, you’ll estimate confidence intervals for several statistics.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]