Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Robert I. Kabacoff - R in action

.pdf
Скачиваний:
88
Добавлен:
02.06.2015
Размер:
12.13 Mб
Скачать

306

CHAPTER 12 Resampling statistics and bootstrapping

The first task is to write a function for obtaining the R-squared value:

rsq <- function(formula, data, indices) { d <- data[indices,]

fit <- lm(formula, data=d) return(summary(fit)$r.square)

}

The function returns the R-square value from a regression. The d <- data[indices,] statement is required for boot() to be able to select samples.

You can then draw a large number of bootstrap replications (say, 1,000) with the following code:

library(boot)

set.seed(1234)

results <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp)

The boot object can be printed using

> print(results)

ORDINARY NONPARAMETRIC BOOTSTRAP

Call:

boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ wt + disp)

Bootstrap Statistics :

 

original

bias

std. error

t1* 0.7809306

0.01333670

0.05068926

and plotted using plot(results). The resulting graph is shown in figure 12.2.

In figure 12.2, you can see that the distribution of bootstrapped R-squared values isn’t normally distributed. A 95 percent confidence interval for the R-squared values can be obtained using

> boot.ci(results, type=c("perc", "bca")) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates

CALL :

 

 

 

boot.ci(boot.out

= results, type = c("perc", "bca"))

Intervals :

 

 

Level

Percentile

BCa

95%

( 0.6838,

0.8833 )

( 0.6344, 0.8549 )

Calculations and

Intervals on Original Scale

Some BCa intervals may be unstable

You can see from this example that different approaches to generating the confidence intervals can lead to different intervals. In this case the bias adjusted interval is

 

 

 

Bootstrapping with the boot package

307

 

Histogram of t

 

 

8

 

 

 

0.90

 

 

 

 

 

0.85

 

6

 

 

 

 

 

 

 

 

 

0.80

 

Density 4

 

 

 

t* 0.75

 

 

 

 

 

0.70

 

2

 

 

 

 

 

 

 

 

 

0.65

 

0

 

 

 

0.60

 

0.6

0.7

0.8

0.9

−3 −2 −1 0 1

2 3

 

 

t*

 

Quantiles of Standard Normal

Figure 12.2 Distribution of bootstrapped R-squared values

moderately different from the percentile method. In either case, the null hypothesis H0: R-square = 0, would be rejected, because zero is outside the confidence limits.

In this section, we estimated the confidence limits of a single statistic. In the next section, we’ll estimate confidence intervals for several statistics.

12.6.2 Bootstrapping several statistics

In the previous example, bootstrapping was used to estimate the confidence interval for a single statistic (R-squared). Continuing the example, let’s obtain the 95 percent confidence intervals for a vector of statistics. Specifically, let’s get confidence intervals for the three model regression coefficients (intercept, car weight, and engine displacement).

First, create a function that returns the vector of regression coefficients:

bs <- function(formula, data, indices) { d <- data[indices,]

fit <- lm(formula, data=d) return(coef(fit))

}

Then use this function to bootstrap 1,000 replications:

308 CHAPTER 12 Resampling statistics and bootstrapping

library(boot)

set.seed(1234)

results <- boot(data=mtcars, statistic=bs, R=1000, formula=mpg~wt+disp)

> print(results)

ORDINARY NONPARAMETRIC BOOTSTRAP Call:

boot(data = mtcars, statistic = bs, R = 1000, formula = mpg ~

 

wt + disp)

 

Bootstrap Statistics :

 

 

original

bias

std. error

t1*

34.9606

0.137873

2.48576

t2*

-3.3508

-0.053904

1.17043

t3*

-0.0177

-0.000121

0.00879

When bootstrapping multiple statistics, add an index parameter to the plot() and boot.ci() functions to indicate which column of bootobject$t to analyze. In this example, index 1 refers to the intercept, index 2 is car weight, and index 3 is the engine displacement. To plot the results for car weight, use

plot(results, index=2)

The graph is given in figure 12.3.

To get the 95 percent confidence intervals for car weight and engine displacement, use

> boot.ci(results, type="bca", index=2) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates

CALL :

boot.ci(boot.out = results, type = "bca", index = 2)

Histogram of t

 

0.4

 

 

0

0.3

−1

 

−2

Density 0.2

t* −3

 

−4

0.1

−5

 

−6

0.0

 

−6 −4 −2 0

−3 −2 −1 0 1 2 3

t*

Quantiles of Standard Normal

Figure 12.3 Distribution of bootstrapping regression coefficients for car weight

 

Summary

309

Intervals :

 

Level

BCa

 

95%

(-5.66, -1.19 )

 

Calculations and Intervals on Original Scale

 

> boot.ci(results, type="bca", index=3)

 

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS

 

Based on 1000 bootstrap replicates

 

CALL :

 

 

boot.ci(boot.out = results, type = "bca", index = 3)

 

Intervals :

 

Level

BCa

 

95%

(-0.0331, 0.0010 )

 

Calculations and Intervals on Original Scale

NOTE In the previous example, we resampled the entire sample of data each time. If we assume that the predictor variables have fixed levels (typical in planned experiments), we’d do better to only resample residual terms. See Mooney and Duval (1993, pp. 16–17) for a simple explanation and algorithm.

Before we leave bootstrapping, it’s worth addressing two questions that come up often:

How large does the original sample need to be?

How many replications are needed?

There’s no simple answer to the first question. Some say that an original sample size of 20–30 is sufficient for good results, as long as the sample is representative of the population. Random sampling from the population of interest is the most trusted method for assuring the original sample’s representativeness. With regard to the second question, I find that 1,000 replications are more than adequate in most cases. Computer power is cheap and you can always increase the number of replications if desired.

There are many helpful sources of information on permutation tests and bootstrapping. An excellent starting place is an online article by Yu (2003). Good (2006) provides a comprehensive overview of resampling in general and includes R code. A good, accessible introduction to the bootstrap is provided by Mooney and Duval (1993). The definitive source on bootstrapping is Efron and Tibshirani (1998). Finally, there are a number of great online resources, including Simon (1997), Canty (2002), Shah (2005), and Fox (2002).

12.7 Summary

In this chapter, we introduced a set of computer-intensive methods based on randomization and resampling that allow you to test hypotheses and form confidence intervals without reference to a known theoretical distribution. They’re particularly valuable when your data comes from unknown population distributions, when there are serious

310

CHAPTER 12 Resampling statistics and bootstrapping

outliers, when your sample sizes are small, and when there are no existing parametric methods to answer the hypotheses of interest.

The methods in this chapter are particularly exciting because they provide an avenue for answering questions when your standard data assumptions are clearly untenable, or when you have no other idea how to approach the problem. Permutation tests and bootstrapping aren’t panaceas, though. They can’t turn bad data into good data. If your original samples aren’t representative of the population of interest, or are too small to accurately reflect it, then these techniques won’t help.

In the next chapter, we’ll consider data models for variables that follow known, but not necessarily normal, distributions.

Part 4

Advanced methods

In this final section, we consider advanced methods of statistical and graphical analysis to round out your data analysis toolkit. Chapter 13 expands on the regression methods in chapter 8 to cover parametric approaches to data that aren’t normally distributed. The chapter starts with a discussion of the generalized linear model, and then focuses on cases where we’re trying to predict an outcome variable that’s either categorical (logistic regression) or a count (poisson regression).

Dealing with a large number of variables can be challenging, due to the complexity inherent in multivariate data. Chapter 14 describes two popular methods for exploring and simplifying multivariate data. Principal components analysis can be used to transform a large number of correlated variables into a smaller set of composite variables. Factor analysis consists of a set of techniques for uncovering the latent structure underlying a given set of variables. Chapter 14 provides step-by-step instructions for carrying out each.

More often than not, researchers must deal with incomplete datasets.Chapter 15 considers modern approaches to the ubiquitous problem of missing data values. R supports a number of elegant approaches for analyzing datasets that are incomplete for various reasons. Several of the best approaches are described here, along with guidance around which ones to use, and which ones to avoid.

Chapter 16 completes our discussion of graphics with presentations of some of R’s most advanced and useful approaches to visualizing data. This includes visual representations of complex data using the lattice package, and an introduction to the new, and increasingly popular, ggplot2 package. The chapter ends with a review of packages that provide functions for interacting with graphs in real-time.

After completing part 4, you will have the tools to manage a wide range of complex data analysis problems. This includes modeling non-normal outcome variables, dealing with large numbers of correlated variables, and handling messy and incomplete data. Additionally, you will have the tools to visualize complex data in useful, innovative and creative ways.

Generalized13linear models

This chapter covers

ÊN

ÊN

ÊN

Formulating a generalized linear model

Predicting categorical outcomes

Modeling count data

In chapters 8 (regression) and 9 (ANOVA), we explored linear models that can be used to predict a normally distributed response variable from a set of continuous and/or categorical predictor variables. But there are many situations in which it’s unreasonable to assume that the dependent variable is normally distributed (or even continuous). For example:

ÊN

ÊN

The outcome variable may be categorical. Binary variables (for example, yes/ no, passed/failed, lived/died) and polytomous variables (for example, poor/ good/excellent, republican/democrat/independent) are clearly not normally distributed.

The outcome variable may be a count (for example, number of traffic accidents in a week, number of drinks per day). Such variables take on a limited number of values and are never negative. Additionally, their mean and variance are often related (which isn’t true for normally distributed variables).

313

314

CHAPTER 13 Generalized linear models

Generalized linear models extend the linear model framework to include dependent variables that are decidedly non-normal.

In this chapter, we’ll start with a brief overview of generalized linear models and the glm() function used to estimate them. Then we’ll focus on two popular models within this framework: logistic regression (where the dependent variable is categorical) and Poisson regression (where the dependent variable is a count variable).

To motivate the discussion, we’ll apply generalized linear models to two research questions that aren’t easily addressed with standard linear models:

ÊN

ÊN

What personal, demographic, and relationship variables predict marital infidelity? In this case, the outcome variable is binary (affair/no affair).

What impact does a drug treatment for seizures have on the number of seizures experienced over an eight-week period? In this case, the outcome variable is a count (number of seizures).

We’ll apply logistic regression to address the first question and Poisson regression to address the second. Along the way, we’ll consider extensions of each technique.

13.1 Generalized linear models and the glm() function

A wide range of popular data analytic methods are subsumed within the framework of the generalized linear model. In this section we’ll briefly explore some of the theory behind this approach. You can safely skip over this section if you like and come back to it later.

Let’s say that you want to model the relationship between a response variable Y and a set of p predictor variables X1 ...Xp . In the standard linear model, you assume that Y is normally distributed and that the form of the relationship is

This equation states that the conditional mean of the response variable is a linear combination of the predictor variables. The βj are the parameters specifying the expected change in Y for a unit change in Xj and β0 is the expected value of Y when all the predictor variables are 0. You’re saying that you can predict the mean of the Y distribution for observations with a given set of X values by applying the proper weights to the X variables and adding them up.

Note that you’ve made no distributional assumptions about the predictor variables, Xj. Unlike Y, there’s no requirement that they be normally distributed. In fact, they’re often categorical (for example, ANOVA designs). Additionally, nonlinear functions of the predictors are allowed. You often include such predictors as X 2 or X1 × X2. What is important is that the equation is linear in the parameters (β0, β1,... βp ).

In generalized linear models, you fit models of the form

where g(µY) is a function of the conditional mean (called the link function). Additionally, you relax the assumption that Y is normally distributed. Instead, you assume that

Generalized linear models and the glm() function

315

Y follows a distribution that’s a member of the exponential family. You specify the link function and the probability distribution, and the parameters are derived through an iterative maximum likelihood estimation procedure.

13.1.1The glm() function

Generalized linear models are typically fit in R through the glm() function (although other specialized functions are available). The form of the function is similar to lm() but includes additional parameters. The basic format of the function is

glm(formula, family=family(link=function), data=)

where the probability distribution (family) and corresponding default link function (function) are given in table 13.1.

Table 13.1 glm() parameters

Family

Default link function

 

 

binomial

(link = "logit")

gaussian

(link = "identity")

gamma

(link = "inverse")

inverse.gaussian

(link = "1/mu^2")

poisson

(link = "log")

quasi

(link = "identity", variance = "constant")

quasibinomial

(link = "logit")

quasipoisson

(link = "log")

 

 

The glm() function allows you to fit a number of popular models, including logistic regression, Poisson regression, and survival analysis (not considered here). You can demonstrate this for the first two models as follows. Assume that you have a single response variable (Y), three predictor variables (X1, X2, X3), and a data frame (mydata) containing the data.

Logistic regression is applied to situations in which the response variable is dichotomous (0,1). The model assumes that Y follows a binomial distribution, and that you can fit a linear model of the form

where π = μY is the conditional mean of Y (that is, the probability that Y = 1 given a set of X values), (π/1 – π) is the odds that Y = 1, and log(π/1 – π) is the log odds, or logit. In this case, log(π/1 – π) is the link function, the probability distribution is binomial, and the logistic regression model can be fit using

glm(Y~X1+X2+X3, family=binomial(link="logit"), data=mydata)

Logistic regression is described more fully in section 13.2.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]