Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

R in Action, Second Edition.pdf

Скачиваний:

540

Добавлен:

26.03.2016

Размер:

20.33 Mб

Скачать

☆

<<< < Предыдущая 52 53 54 55 56 57 58 59 60 61 62 6364 / 17364 65 66 67 68 69 70 71 72 73 74 75 76 > Следующая >>>

Regression diagnostics

187

8.3.2An enhanced approach

The car package provides a number of functions that significantly enhance your ability to fit and evaluate regression models (see table 8.4).

Table 8.4 Useful functions for regression diagnostics (car package)

Function	Purpose

qqPlot()	Quantile comparisons plot
durbinWatsonTest()	Durbin–Watson test for autocorrelated errors
crPlots()	Component plus residual plots
ncvTest()	Score test for nonconstant error variance
spreadLevelPlot()	Spread-level plots
outlierTest()	Bonferroni outlier test
avPlots()	Added variable plots
influencePlot()	Regression influence plots
scatterplot()	Enhanced scatter plots
scatterplotMatrix()	Enhanced scatter plot matrixes
vif()	Variance inflation factors

It’s important to note that there are many changes between version 1.x and version 2.x of the car package, including changes in function names and behavior. This chapter is based on version 2.

In addition, the gvlma package provides a global test for linear model assumptions. Let’s look at each in turn, by applying them to our multiple regression example.

NORMALITY

The qqPlot() function provides a more accurate method of assessing the normality assumption than that provided by the plot() function in the base package. It plots the studentized residuals (also called studentized deleted residuals or jackknifed residuals) against a t distribution with n – p – 1 degrees of freedom, where n is the sample size and p is the number of regression parameters (including the intercept). The code follows:

library(car)

states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) qqPlot(fit, labels=row.names(states), id.method="identify",

simulate=TRUE, main="Q-Q Plot")

188		CHAPTER 8 Regression
The qqPlot() function generates				Q-Q Plot
the probability plot displayed in						Nevada
figure 8.9. The option id.method				3
="identify" makes the plot inter-
active—after the graph is drawn,			Residuals(fit)	1 2
mouse clicks on points in the
graph will label them with values
specified in the labels option of
specified in the labels option of			Studentized
the function. Pressing the Esc key,				0
selecting	Stop from	the graph’s
drop-down menu, or right-clicking
drop-down menu, or right-clicking
the graph turns off this interactive
mode. Here, I identified Nevada.
When simulate=TRUE, a 95% con-				0	1	2
fidence	envelope is	produced		0	1	2
fidence	envelope is	produced		t Quantiles
using a	parametric	bootstrap.	Figure 8.9 Q-Q plot for studentized residuals
(Bootstrap methods		are consid-	Figure 8.9 Q-Q plot for studentized residuals
(Bootstrap methods		are consid-

ered in chapter 12.)

With the exception of Nevada, all the points fall close to the line and are within the confidence envelope, suggesting that you’ve met the normality assumption fairly well. But you should definitely look at Nevada. It has a large positive residual (actual – predicted), indicating that the model underestimates the murder rate in this state. Specifically:

> states["Nevada",]

	Murder	Population	Illiteracy	Income	Frost
Nevada	11.5	590	0.5	5149	188

> fitted(fit)["Nevada"]

Nevada 3.878958

> residuals(fit)["Nevada"]

Nevada 7.621042

> rstudent(fit)["Nevada"]

Nevada 3.542929

Here you see that the murder rate is 11.5%, but the model predicts a 3.9% murder rate. The question that you need to ask is, “Why does Nevada have a higher murder rate than predicted from population, income, illiteracy, and temperature?” Anyone (who

hasn’t see Goodfellas) want to guess?

Regression diagnostics

189

For completeness, here’s another way of visualizing errors. Take a look at the code in the next listing. The residplot() function generates a histogram of the studentized residuals and superimposes a normal curve, kernel-density curve, and rug plot. It doesn’t require the car package.

Listing 8.6 Function for plotting studentized residuals

residplot <- function(fit, nbreaks=10) { z <- rstudent(fit)

hist(z, breaks=nbreaks, freq=FALSE, xlab="Studentized Residual", main="Distribution of Errors")

rug(jitter(z), col="brown") curve(dnorm(x, mean=mean(z), sd=sd(z)),

add=TRUE, col="blue", lwd=2) lines(density(z)$x, density(z)$y,

col="red", lwd=2, lty=2) legend("topright",

legend = c( "Normal Curve", "Kernel Density Curve"), lty=1:2, col=c("blue","red"), cex=.7)

}

residplot(fit)

The results are displayed in figure 8.10.

As you can see, the errors follow a normal distribution quite well, with the exception of a large outlier. Although the Q-Q plot is probably more informative, I’ve always

	0.4
	0.3
Density	0.2
	0.1
	0.0

Distribution of Errors

Normal Curve

Kernel Density Curve

−2

−1

Figure 8.10 Distribution of studentized residuals using the residplot() function

Studentized Residual

190	CHAPTER 8 Regression

found it easier to gauge the skew of a distribution from a histogram or density plot than from a probability plot. Why not use both?

INDEPENDENCE OF ERRORS

As indicated earlier, the best way to assess whether the dependent variable values (and thus the residuals) are independent is from your knowledge of how the data were collected. For example, time series data often display autocorrelation—observations collected closer in time are more correlated with each other than with observations distant in time. The car package provides a function for the Durbin–Watson test to detect such serially correlated errors. You can apply the Durbin–Watson test to the multiple-regression problem with the following code:

> durbinWatsonTest(fit)

lag Autocorrelation D-W Statistic p-value

-0.201

2.32

0.282

Alternative hypothesis: rho != 0

The nonsignificant p-value (p=0.282) suggests a lack of autocorrelation and, conversely, an independence of errors. The lag value (1 in this case) indicates that each observation is being compared with the one next to it in the dataset. Although appropriate for time-dependent data, the test is less applicable for data that isn’t clustered in this fashion. Note that the durbinWatsonTest() function uses bootstrapping (see chapter 12) to derive p-values. Unless you add the option simulate=FALSE, you’ll get a slightly different value each time you run the test.

LINEARITY

You can look for evidence of nonlinearity in the relationship between the dependent variable and the independent variables by using component plus residual plots (also known as partial residual plots). The plot is produced by the crPlots() function in the car package. You’re looking for any systematic departure from the linear model that you’ve specified.

To create a component plus residual plot for variable X, you plot the points

εi + (β0 + β1 × X1i + ... + βk × Xki) vs. Xi

where the residuals are based on the full model, and i =1 … n. The straight line in each graph is given by (β0 + β1 × X1i + ... + βk × Xki) vs. Xi . Loess fit lines are described in chapter 11. The code to produce these plots is as follows:

>library(car)

>crPlots(fit)

The resulting plots are provided in figure 8.11. Nonlinearity in any of these plots suggests that you may not have adequately modeled the functional form of that predictor in the regression. If so, you may need to add curvilinear components such as polynomial terms, transform one or more variables (for example, use log(X) instead of X), or abandon linear regression in favor of some other regression variant. Transformations are discussed later in this chapter.

Regression diagnostics

191

Component + Residual Plots

							8
Component+Residual(Murder)	−4 −2 0 2 4 6					Component+Residual(Murder)	−4 −2 0 2 4 6
	−6
	0	5000	10000	15000	20000		0.5	1.0	1.5	2.0	2.5
			Population						Illiteracy

	8					8
Component+Residual(Murder)	−4 −2 0 2 4 6				Component+Residual(Murder)	−4 −2 0 2 4 6
	3000	4000	5000	6000		0	50	100	150
			Income					Frost

Figure 8.11 Component plus residual plots for the regression of murder rate on state characteristics

The component plus residual plots confirm that you’ve met the linearity assumption. The form of the linear model seems to be appropriate for this dataset.

HOMOSCEDASTICITY

The car package also provides two useful functions for identifying non-constant error variance. The ncvTest() function produces a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the fitted values. A significant result suggests heteroscedasticity (nonconstant error variance).

The spreadLevelPlot() function creates a scatter plot of the absolute standardized residuals versus the fitted values and superimposes a line of best fit. Both functions are demonstrated in the next listing.

192	CHAPTER 8 Regression

Listing 8.7 Assessing homoscedasticity

>library(car)

>ncvTest(fit)

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare=1.7 Df=1 p=0.19

> spreadLevelPlot(fit)

Suggested power transformation: 1.2

The score test is nonsignificant (p = 0.19), suggesting that you’ve met the constant variance assumption. You can also see this in the spread-level plot (figure 8.12). The points form a random horizontal band around a horizontal line of best fit. If you’d violated the assumption, you’d expect to see a nonhorizontal line. The suggested power transformation in listing 8.7 is the suggested power p (Yp) that would stabilize the nonconstant error variance. For example, if the plot showed a nonhorizontal trend and the suggested power transformation was 0.5, then using Y rather than Y in the regression equation might lead to a model that satisfied homoscedasticity. If the suggested power was 0, you’d use a log transformation. In the current example, there’s no evidence of heteroscedasticity, and the suggested power is close to 1 (no transformation required).

Spread−Level Plot for fit

	2.00
Residuals	1.00
Absolute Studentized	0.10 0.20 0.50
	0.05

Fitted Values

Figure 8.12 Spread-level plot for assessing constant error variance

<<< < Предыдущая 52 53 54 55 56 57 58 59 60 61 62 6364 / 17364 65 66 67 68 69 70 71 72 73 74 75 76 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
05.08.2019741.83 Кб0psihologia.rtf
#
02.06.2015162.69 Кб76Psyh_final_ver.docx
#
02.06.2015141.74 Кб44Psyh_final_ver.docx
#
26.03.2016226.3 Кб23public_corporation.doc
#
26.03.2016451.53 Кб7pud_finansovyy-menedjment_318476.pdf
#
26.03.201620.33 Mб540R in Action, Second Edition.pdf
#
26.03.2016296.21 Кб17Radaev_Kak_napisat_akademicheskiy_text.pdf
#
26.03.20163.76 Mб4Raeff_Modernity.pdf
#
26.03.20162.12 Mб19raigorodskii_d_ya_hrestomatiya_psihologiya_lich.pdf
#
02.06.2015494.59 Кб6raschet_SRK_smorodin.doc
#
02.06.201563.98 Кб4referat_IOGP_3.docx