Please make sure that it is your own work and not copy and paste. Please read the study guide and Please watch out for Spelling and Grammar errors. Please use the APA 7th edition.
Book Reference: Fox, J. (2017). Using the R Commander: A point-and-click interface for R. CRC Press. https://online.vitalsource.com/#/books/9781498741934
Provide an example of how simple linear regression could be used within your potential field of study for your dissertation. Please make sure you address the purpose of regression and the type of results you would obtain. Also please discuss the assumptions that need to be met to use this type of analysis. Your EOSA modules discuss this. Clearly identify the variables you are considering.
7.1 Linear Regression Models
As mentioned, linear least-squares regression is typically taken up in a basic statistics course. Thenormal linear regression modelis written
whereyiis the value of the response variable for theith ofnindependently sampled observations;x1i,x2i,…,xkiare the values ofkexplanatory variables; and the errorsεiare normally and independently distributed with 0 means and constant variance,εi∼ NID(0,σε2). Bothyand thexs are numeric variables, and the model assumes that the average value E(y)ofyis a linear function—that is, a simple weighted sum—of thexs.
If there is just onex(i.e., ifk= 1), then
is called thelinear simple regression model; if there are more than onex(k≥ 2), then it is called thelinear multiple regression model.
The normal linear model is optimally estimated by themethod of least squares, producing thefitted model
wherey^iis thefitted valueandeitheresidualfor observationi. The least-squares criterion
FIGURE 7.1: TheLinear Regressiondialog for Duncan’s occupational prestige data.
selects the values of the bs that minimize the sum of squared residuals,∑ei2. The least-squares regression coefficients are easily computed, and, in addition to having desirable statistical properties under the model (such as efficiency and unbias), statistical inference based on the least-squares estimates is very simple (see, e.g., the references given at the beginning of the chapter).
The simplest way to fit a linear regression model in theR Commanderis by theLinear Regressiondialog. To illustrate, I’ll use Duncan’s occupational prestige data (introduced in
). Duncan’s data set resides in thecarpackage, and so I can read the data into theR CommanderviaData > Data in packages > Read data from an attached package(see
). Then selectingStatistics > Fit models > Linear regressionproduces the dialog in
. To complete the dialog, I click onprestigein theResponse variablelist, and Ctrl-click oneducationandincomein theExplanatory variableslist. Finally, pressing theOKbutton produces the output shown in
The commands generated by theLinear Regressiondialog use thelm(linear model) function inRto fit the model, creatingRegModel.1, and then summarize the model to produce printed output. The summary output includes information about the distribution of the residuals; coefficient estimates, their standard errors,tstatistics for testing the null hypothesis that each population regression coefficient is 0, and the two-sided p-values for these tests; the standard deviation of the residuals (“residual standard error”) and residual degrees of freedom; the squared multiple correlation,R2, for the model and R2adjusted for degrees of freedom; and the omnibusFtest for the hypothesis that all population slope coefficients (here the coefficients ofeducationandincome)are 0 (H0:β1=β2= 0, for the example).
This is more or less standard least-squares regression output, similar to printed output produced by almost all statistical packages. What is unusual is that in addition to the printout in
, theR Commandercreates and retains alinear model objecton which I can perform further computations, as illustrated later in this chapter.
TheModelbutton in theR Commandertoolbar now readsRegModel.1, rather than
FIGURE 7.2: Output from Duncan’s regression of occupationalprestigeonincomeandeducation, produced by theLinear Regressiondialog.
The variable lists in theLinear Regressiondialog in
include only numeric variables. For example, the factortype(type of occupation) in Duncan’s data set, with levels“bc”(blue-collar),“wc”(white-collar), and“prof”(professional, technical, or managerial), doesn’t appear in either variable list. Moreover, the explanatory variables that are selected enter the model linearly and additively. TheLinear Modeldialog, described in the next section, is capable of fitting a much wider variety of regression models.
In completing theLinear Regressiondialog in
, I left the name of the model at its default,RegModel.1. TheR Commandergenerates unique model names automatically during a session, each time incrementing the model number (here 1).
I also left theSubset expressionat its default,
for example, the regression model would have been fit only to blue-collar occupations. As in this example, the subset expression can be a logical expression, returning the valueTRUEorFALSEfor each case (see
), a vector of case indices to include,
or a negative vector of case indices toexclude. For example,1:25would include the first 25 occupations, while-c(6, 16)would exclude occupations 6 and 16.
All of the statistical modeling dialogs in theR Commanderallow subsets of cases to be specified in this manner.
7.2 Linear Models with Factors*
Like theLinear Regressiondialog described in the preceding section, theLinear Modeldialog can fit additive linear regression models, but it is much more flexible: TheLinear Modeldialog accommodates transformations of the response and explanatory variables, factors as well as numeric explanatory variables on the right-hand-side of the regression model, nonlinear functions of explanatory variables expressed as polynomials and regression splines, and interactions among explanatory variables. All this is accomplished by allowing the user to specify the model as anRlinear-model formula. Linear-model formulas inRare inherited from theSprogramming language (Chambers and Hastie, 1992), and are a version of notation for expressing linear models originally introduced by Wilkinson and Rogers (1973).
7.2.1 Linear-Model Formulas
AnRlinear-model formula is of the general formresponse-variable∼linear-predictor. The tilde (~) in a linear-model formula can be read as “is regressed on.” Thus, in this general form, the response variable is regressed on a linear predictor comprising thetermsin the right-hand side of the model.
The left-hand side of the model formula,response-variable, is anRexpression that evaluates to the numeric response variable in the model, and is usually simply thenameof the response variable—for example,prestigein Duncan’s regression. You can, however, transform the response variable directly in the model formula (e.g.,log10(income)) or compute the response as a more complex arithmetic expression (e.g.,log(investment.income + hourly.wage.rate*hours.worked).
The formulation of the linear predictor on the right-hand side of a model formula is more complex. What are normally arithmetic operators (+,-,*,/, and^) inRexpressions have special meanings in a model formula, as do the operators : (colon) and%in%. The numeral1(one) may be used to represent the regression constant (i.e., the intercept) in a model formula; this is usually unnecessary, however, because an intercept is included by default. A period (.) represents all of the variables in the data set with the exception of the response. Parentheses may be used for grouping, much as in an arithmetic expression.
In the large majority of cases, you’ll be able to formulate a model using only the operators+(interpreted as “and”) and*(interpreted as “crossed with”), and so I’ll emphasize these operators here. The meaning of these and the other model-formula operators are summarized and illustrated in
. Especially on first reading, feel free to ignore everything in the table except+, :, and*(and : is rarely used directly).
A final formula subtlety: As I’ve explained, the arithmetic operators take on special meanings on the right-hand side of a linear-model formula. A consequence is that you can’t use these operators directly for arithmetic. For example, fitting the modelsavings~wages + interest + dividendsestimates aseparateregression coefficient for each ofwages, interest, anddividends. Suppose, however, that you want to estimate asinglecoefficient for the sum of these variables—in effect, setting the three coefficients equal to each other. The solution is to “protect” the+operator inside a call to theI(identityorinhibit) function, which simply returns its argument unchanged:
savings∼I(wages + interest + dividends). This formula works as desired because arithmetic operators like+have their usual meaningwithina function call on the right-hand side of the formula—implying, incidentally, thatsavings ∼ log10(wages + interest + dividends)also works as intended, estimating a single coefficient for the log base 10 of the sum ofwages, interest, anddividends.
TABLE 7.1: Operators and other symbols used on the right-hand side of R linear-model formulas.
x1 + x2
regression through the origin (for numericx1)
cross to orderk
(x1 + x2 + x3)^2
province %in% country
same ascountry + province %in% country
suppress the intercept
everything but the response
regressyon everything else
x1*(x2 + x3)
same asx1*x2 + x1*x3
The symbolsx1, x2, andx3represent explanatory variables and could be either numeric or factors.
7.2.2 The Principle of Marginality
Introduced by Nelder (1977), theprinciple of marginalityis a rule for formulating and interpreting linear (and similar) statistical models. According to the principle of marginality, if aninteraction, sayx1:x2, is included in a linear model, then so should themain effects,x1andx2, that aremarginalto—that islower-order relativesof—the interaction. Similarly, thelower-order interactionsx1:x2, x1:x3, andx2:x3are marginal to thethree-way interactionx1:x2:x3. The regression constant(1in anRmodel formula) is marginal to every other term in the model.
It is in most circumstances difficult inRto formulate models that violate the principle of marginality, and trying to do so can produce unintended results. For example, although it may appear that the modely ∼ f*x – x – 1, wherefis a factor andxis a numeric explanatory variable,
violates the principle of marginality by removing the regression constant andxslope, the model thatRactually fits includes a separate intercept and slope for each level of the factorf. Thus, the modely ∼ f*x – x – 1is equivalent to (i.e., an alternative parametrization of)y ∼ f*x. It is almost always best to stay away from such unusual model formulas.
7.2.3 Examples Using the Canadian Occupational Prestige Data
For concreteness, I’ll formulate several linear models for the Canadian occupational prestige data (introduced in
and described in
), regressingprestigeonincome, education, women(gender composition), andtype(type of occupation). The last variable is a factor (categorical variable) and so it cannot enter into the linear model directly. When a factor is included in a linear-model formula,Rgeneratescontraststo represent the factor—one fewer than the number of levels of the factor. I’ll explain how this works in greater detail inSection 7.2.4, but the default in theR Commander(andRmore generally) is to use 0/1dummy-variable regressors, also calledindicator variables.
A version of the Canadian occupational prestige data resides in the data framePrestigein thecarpackage,9and it’s convenient to read the data into theR Commanderfrom this source viaData > Data in packages > Read data from an attached package.PrestigereplacesDuncanas the active data set.
Recall that 4 of the 102 occupations in thePrestigedata set have missing values(NA)for occupationaltype. Because I will fit several regression models to thePrestigedata, not all of which includetype, I begin by filtering the data set for missing values, selectingData > Active data set > Remove cases with missing data(as described inSection 4.5.2).
Moreover, the default alphabetical ordering of the levels oftype—“bc”,“prof”,“wc”—is not the natural ordering, and so I also reorder the levels of this factor viaData > Manage variables in active data set > Reorder factor levelsto“bc”, “wc”, “prof”(seeSection 3.4). This last step isn’t strictly necessary, but it makes the data analysis easier to follow.
I first fit an additive dummy regression to the Canadian prestige data, employing the model formulaprestige∼income + education + women + type. To do so, I selectStatistics > Fit models > Linear modelfrom theR Commandermenus, producing the dialog box inFigure 7.3. The automatically supplied model name isLinearModel.2, reflecting the fact that I have already fit a statistical model in the session,RegModel.1(inSection 7.1).
Most of the structure of theLinear Modeldialog is common to statistical modeling dialogs in theR Commander. If the response text box to the left of the ∼ in the model formula is empty, double-clicking on a variable name in the variable list box enters the name into the response box; thereafter, double-clicking on variable names enters the names into the right-hand side of the model formula, separated by +s (if no operator appears at the end of the partially completed formula). You can enter parentheses and operators like+and*into the formula using the toolbar in the dialog box.10You can also type directly into the model-formula text boxes. InFigure 7.3, I simply double-clicked successively onprestige, education, income, women, andtype.11ClickingOKproduces the output shown inFigure 7.4.
I already explained the general format of linear-model summary output inR. What’s new inFigure 7.4is the way in which the factortypeis handled in the linear model: Two dummy-variable regressors are automatically created for the three-level factortype. The first dummy regressor, labelledtype[T.wc]in the output, is coded 1 whentypeis“wc”and 0 otherwise; the second dummy regressor,type[T.prof], is coded 1 whentypeis“prof”and 0 otherwise. The first level oftype—“bc”—is therefore selected as thereferenceorbaseline level, coded 0 for both dummy regressors.12
Consequently, the intercept in the linear-model output is the intercept for the“bc”reference level oftype, and thecoefficients for the other levels give differences in the intercepts between each of these levels and the reference level. Because the slope coefficients for the numeric explanatory variableseducation, income, andwomenin this additive model do not vary by levels oftype, the dummy-variable coefficients are also interpretable as the average difference between each other level and“bc”foranyfixed values ofeducation, income, andwomen.
FIGURE 7.3:Linear Modeldialog completed to fit an additive dummy-variable regression ofprestigeon the numeric explanatory variableseducation, income, andwomen, and the factortype.
To illustrate a structurally more complex, nonadditive model, I respecify the Canadian occupational prestige regression model to include interactions betweentypeandeducationand betweentypeandincome, in the process removingwomenfrom the model—in the initial regression, the coefficient ofwomenis small with a large p-value.
TheLinear Modeldialog (not shown) reopens in its previous state, with the model name incremented toLinearModel.3. To fit the new model, I modify the formula to readprestige ∼ type*education + type*income. ClickingOKproduces the output in
With interactions in the model, there are different intercepts and slopes for each level oftype. The intercept in the output—along with the coefficients foreducationandincome—pertains to the baseline level“bc”oftype. Other coefficients represent differences between each of the other levels and the baseline level. For example,type[T.wc]= –33.54 is the difference in intercepts between the“wc”and“bc”levels oftype;
similarly, the interaction coefficienttype[T.wc]:education= 4.291 is the difference ineducationslopes between the“wc”and“bc”levels. The complexity of the coefficients makes it difficult to understand what the model says about the data;
shows how to visualize terms such as interactions in a complex linear model.
FIGURE 7.4: Output for the linear modelprestige ∼ income + education + women + typefit to thePrestigedata.
FIGURE 7.5: Output for the linear modelprestige ∼ type*education + type*incomefit to thePrestigedata.
TABLE 7.2: Contrast-regressor codings fortypegenerated bycontr.Treatment, contr.Sum, contr.poly,, andcontr.Helmert.
Levels of type
7.2.4 Dummy Variables and Other Contrasts for Factors
By default in theR Commander, factors in linear-model formulas are represented by 0/1 dummy-variable regressors generated by thecontr.Treatmentfunction in thecarpackage, picking the first level of a factor as the baseline level.
This contrast coding, along with some other choices, is shown in
, using the factortypein thePrestigedata set as an example.
The functioncontr.Sumfrom thecargenerates so-called “sigma-constrained” or “sum-to-zero” contrast regressors, as are used in traditional treatments of analysis of variance.
The standardRfunctioncontr.polygenerates orthogonal-polynomial contrasts—in this case, linear and quadratic terms for the three levels oftype; in theR Commander,contr.polyis the default choice for ordered factors. Finally,contr.Helmertgenerates Helmert contrasts, which compare each level to the average of those preceding it.
SelectingData > Manage variables in active data set > Define contrasts for a factorproduces the dialog box on the left of
. The factortypeis preselected in this dialog because it’s the only factor in the data set. You can use the radio buttons to choose among treatment, sum-to-zero, Helmert, and polynomial contrasts, or define customized contrasts by selectingOther, as I’ve done here.
ClickingOKleads to the sub-dialog shown on the right of
. I change the default contrast names, .1and .2, to[bc.v.others]and[wc.v.prof], and then fill in the contrast coefficients (i.e., the values of the contrast regressors). This choice produces contrast regressors namedtype[bc.v.others]andtype[wc.v.prof], to be used when the factortypein thePrestigedata set appears in a linear-model formula. Contrasts defined directly in this manner must be linearly independent and are simplest to interpret if they obey two additional rules: (1) The coefficients for each contrast should sum to 0, and (2) each pair of contrasts should be orthogonal (i.e., the products of corresponding coefficients for each pair of contrasts sum to 0).
FIGURE 7.6: TheSet Contrasts for Factordialog box (left) and theSpecify Contrastssub-dialog (right), creating contrasts for the factortypein thePrestigedata set.
To see how these contrasts are reflected in the coefficients of the model, I refit the additive regression ofprestigeoneducation, income, women, andtype, producing the output inFigure 7.7. The first contrast fortypeestimates the difference between“bc”and the average of the other two levels oftype, holding the other explanatory variables constant, while the second contrast estimates the difference between“wc”and“prof”. This alternative contrast coding fortypeproduces different estimates for the intercept andtypecoefficients from the dummy-regressor coding fortypeinFigure 7.4(onpage 136), but the two models have the same fit to the data (e.g.,R2= 0.8349).17
FIGURE 7.7: Output for the linear modelprestige ∼ income + education + women + typefit to thePrestigedata, using customized contrasts fortype.
7.3 Fitting Regression Splines and Polynomials*
The second formula toolbar in theLinear Modeldialog makes it easy to add nonlinearpolynomial-regressionandregression-splineterms to a linear model.
7.3.1 Polynomial Terms
Some simple nonlinear relationships can be represented as low-order polynomials, such as a quadratic term, using regressorsxand x2for a numeric explanatory variable x, or a cubic term, using x, x2, and x3. The resulting model is nonlinear in the explanatory variablexbut linear in the parameters (theβs).Rand theR Commandersupport both orthogonal and “raw” polynomials in linear model formulas.18
To add a polynomial term to the right-hand side of the model, single-click on a numeric variable in theVariableslist box, and then press the appropriate toolbar button (eitherorthogonal polynomialorraw polynomial, as desired). There is a spinner in theLinear Modeldialog for the degree of a polynomial term, and the default is 2 (i.e., a quadratic).
For example, inspection of the data (e.g., in a component-plus-residual plot, discussed inSection 7.8)19suggests that there may be a quadratic partial relationship between prestige and women in the regression of prestige on education, income, and women for the Canadian occupational prestige data.20I specify this quadratic relationship in theLinear Modeldialog inFigure 7.8, using a raw second-degree polynomial, and producing the output inFigure 7.9. The quadratic coefficient in the model turns out not to be statistically significant(p= 0.15).
7.3.2 Regression Splines
Regression splines are flexible functions capable of representing a wide variety of nonlinear patterns in a model that, like a regression polynomial, is linear in the parameters. BothB-splinesandnatural splinesare supported by theR CommanderLinear Modeldialog. Adding a spline term to the right-hand side of a linear model is similar to adding a polynomial prestige ∼ poly(women, degree=2, raw=TRUE) + ns(education, df=5) + ns(income, df=5), regressingprestigeon a quadratic inwomenand 5-df natural splines ineducationandincome. The output for the resulting regression model isn’t shown because the model requires graphical interpretation (seeSection 7.6): The coefficient estimates for the regression splines are not simply interpretable.21
FIGURE 7.8:Linear Modeldialog with a polynomial (quadratic) term forwomenin the regression ofprestigeoneducation, income, andwomenusing thePrestigedata set.
FIGURE 7.9: Output from the regression of prestige on education, income, and a quadratic in women for thePrestigedata.
FIGURE 7.10:Linear Modeldialog showing regression-spline and polynomial terms for the regression ofprestigeoneducation, income, andwomenin thePrestigedata set.
7.4 Generalized Linear Models*
Briefly,generalized linear models(orGLMs), introduced in a seminal paper by Nelder and Wedderburn (1972), consist of three components:
1.Arandom componentspecifying the distribution of the responseyconditional on explanatory variables. Traditionally, the random component is a member of anexponential family—the Gaussian (normal), binomial, Poisson, gamma, or inverse Gaussian families—but both the theory of generalized linear models and their implementation inRare now more general: In addition to the traditional exponential families,Rprovides for quasi-binomial and quasi-Poisson families that accommodate “over-dispersed” binomial and count data.
on which the expectation of the response variableμi=E(yi) for theith ofnindependent observations depends, where the regressorsxjiare prespecified functions of the explanatory variables—numeric explanatory variables, dummy regressors representing factors, interaction regressors, and so on, exactly as in the linear model.
3.A prespecified invertiblelink function g(.) that transforms the expectation of the response to the linear predictor,g(μi) =ηi, and thusμi=g−1(ηi).Rimplements identity, inverse, log, logit, probit, complementary log-log, square root, and inverse square links, with the applicable links varying by distributional family.
The most common GLM beyond the normal linear model (i.e., the Gaussian family paired with identity link) is the binomial logit model, suitable for dichotomous (two-category) response variables. For an illustration, I’ll use data collected by Cowles and Davis (1987) on volunteering for a psychological experiment, where the subjects of the study were students in a university introductory psychology class.
The data for this example are contained in the data setCowlesin thecarpackage,
which includes the following variables:neuroticism, a personality dimension with integer scores ranging from 0 to 24;extraversion, another personality dimension, also with scores from 0 to 24;sex, a factor with levels“female”and“male”; andvolunteer, a factor with levels“no”and“yes”.
In analyzing the data, Cowles and Davis performed a logistic regression of volunteering onsexand the linear-by-linear interaction betweenneuroticismandextraversion. To fit Cowles and Davis’s model, I first read the data from thecarpackage in the usual manner, makingCowlesthe active data set in theR Commander. Then I selectStatistics > Fit models > Generalized linear model, producing the dialog box in
TheGeneralized Linear Modeldialog is very similar to theLinear Modeldialog of the preceding section: The name of the model at the top(GLM.7)is automatically generated, and you can change it if you wish. Double-clicking on a variable in the list box enters the variable into the model formula. There are toolbars for entering operators, regression splines, and polynomials into the model formula, and there are boxes for subsetting the data set and for specifying prior weights.
FIGURE 7.11:Genealized Linear Modeldialog box for Cowles and Davis’s logistic regression.
What’s new in theGeneralized Linear Modeldialog are theFamilyandLink functionlist boxes, as are appropriate to a GLM. Families and links are coordinated: Double-clicking on a distributional family changes the available links. In each case, thecanonical linkfor a particular family is selected by default. The initial selections are thebinomialfamily and corresponding canonicallogitlink, which are coincidentally what I want for the example.
I proceed to complete the dialog by double-clicking onvolunteerin the variable list, making it the response variable; then double-clicking onsexand onneuroticism; clicking the*button in the toolbar; and finally double-clicking onneuroticism—yielding the model formulavolunteer ∼ sex + neuroticism*extraversion. As in theLinear Modeldialog, an alternative is to type the formula directly.
Appropriate responses for a binomial logit model include two-level factors (such asvolunteerin the current example), logical variables (i.e., with valuesFALSEandTRUE), and numeric variables with two unique values (most commonly 0 and 1). In each case, the logit model is for the probability of thesecondof the two values—the probability thatvolunteeris“yes”in the example.
Clicking theOKbutton produces the output in
. TheGeneralized Linear Modeldialog uses theRglmfunction to fit the model. The summary output for a generalized linear model is very similar to that for a linear model, including a table of estimated coefficients along with their standard errors,zvalues(Wald statistics) for testing that the coefficients are 0, and the two-sidedp-values for these tests. For a logistic regression, theR Commanderalso prints the exponentiated coefficients, interpretable as multiplicative effects on the odds scale—here the odds of volunteering, Pr(“yes”)/Pr(“no”).
The Waldztests suggest a statistically significant interaction betweenneuroticismandextraversion, as Cowles and Davis expected, and a significantsexeffect, with men less likely to volunteer than women who have equivalent scores on the personality dimensions. Because it’s hard to grasp the nature of the interaction directly from the coefficient estimates, I’ll return to this example in
, where I’ll plot the fitted model.
Although I’ve developed just one example of a generalized linear model in this section—a logit model for binary data—theR CommanderGeneralized Linear Modeldialog is more flexible:
•The probit and complementary log-log(cloglog) link functions may also be used with binary data, as alternatives to the canonical logit link.
•The binomial family may also be used when the value of the response variable for each case (orobservation) represents the proportion of “successes” in a given number of binomial trials, which may also vary by case. In this setting, the left-hand side of the model formula should give the proportion of successes, which could be computed assuccesses/trials(imagining that there are variables with these names in the active data set) directly in the left-hand box of the model formula, and the variable representing the number of trials for each observation (e.g.,trials)should be given in theWeightsbox.
•Alternatively, for binomial data, the left-hand side of the model may be a two-column matrix specifying, respectively, the numbers of successes and failures for each observation, by typing, e.g.,cbind(successes, failures)(again, imagining that these variable are in the active data set) into the left-hand-side box of the model formula.
•Other generalized linear models are specified by choosing a different family and corresponding link. For example, a Poisson regression model, commonly employed for count data, may be fit by selecting thepoissonfamily and canonicalloglink (or, to get typically more realistic coefficient standard errors, by selecting thequasipoissonfamily with theloglink).
FIGURE 7.12: Output from Cowles and Davis’s logistic regression(volunteer ∼ sex + neuroticism*extraversion).
7.5 Other Regression Models*
In addition to linear regression, linear models, and generalized linear models, theR Commandercan fitmultinomial logit modelsfor categorical response variables with more than two categories (viaStatistics > Fit models > Multinomial logit model), andordinal regression modelsfor ordered multi-category responses, including theproportional-odds logit modeland theordered probit model (Statistics > Fit models > Ordinal regression model). Although I won’t illustrate these models here, many of the menu items in theModelsmenu apply to these classes of models. Moreover (as I will show inChapter 9),R Commanderplug-in packages can introduce additional classes of statistical models.
7.6 Visualizing Linear and Generalized Linear Models*
Introduced by Fox (1987),effect plotsare graphs for visualizing complex regression models by focusing on particular explanatory variables or combinations of explanatory variables, holding other explanatory variables to typical values. One strategy is to focus successively on the explanatory variables in thehigh-order termsof the model—that is, terms that aren’t marginal to others (seeSection 7.2.2).
In theR Commander, effect displays can be drawn for linear, generalized linear, and some other statistical models viaModels > Graphs > Effect plots.Figure 7.13shows the resulting dialog box for Cowles and Davis’s logistic regression from the previous section,GLM.7, which is the current statistical model in theR Commander. By default, the dialog offers to plot all high-order terms in the model—in this case, thesexmain effect and theneuroticism-by-extraversioninteraction. You may alternatively pick a subset ofPredictors(explanatory variables) to plot.23For a linear or generalized linear model, there’s also a check box for plotting partial residuals, unchecked by default, along with a slider for the span of a smoother fit to the residuals. Partial residuals and the accompanying smoother can be useful for judging departures from the functional form of the specified model, as I’ll illustrate later in this section.
ClickingOKproduces the graph inFigure 7.14: The left-hand panel shows thesexmain effect, withneuroticismandextraversionset to average levels. The right-hand panel shows theneuroticism-by-extraversioninteraction, for a group composed of males and females in proportion to their representation in the data set. In both graphs, the verticalvolunteeraxis is drawn on the logit scale but the tick-mark labels are on the estimated probability scale—that is, they represent the estimated probability of volunteering.24
In the plot of the interaction, the horizontal axis of each panel is forneuroticism, whileextraversiontakes on successively larger values across its range, from the lower-left panel to the upper-right panel. The value ofextraversionfor each panel is represented by the small vertical line in the strip labelledextraversionat the top of the panel.
FIGURE 7.13:Model Effect Plotsdialog box for Cowles and Davis’s logistic regression(volunteer ∼ sex + neuroticism*extraversion).
The lines in the panels represent the combined effect ofneuroticismandextraversion, and are computed using the estimated coefficients for theneuroticism:extraversioninteraction along with the coefficients for theneuroticismandextraversion“main effects,” which are marginal to the interaction. It’s clear that there’s a positive relationship between volunteering andneuroticismat the lowest level ofextraversion, but that this relationship becomes negative at the highest level ofextraversion.
The error bars in the effect plot forsexand the gray bands in the plot of theneuroticism-by-extraversioninteraction represent point-wise 95% confidence intervals around the estimated effects. The “rug-plot” at the bottom of each panel in the display of theneuroticism-by-extraversioninteraction shows the marginal distribution ofneuroticism, with the lines slightly displaced to decrease over-plotting. The rug-plot isn’t terribly useful here, becauseneuroticismjust takes on integer scores.
FIGURE 7.14: Effect plots for the high-order terms in Cowles and Davis’s logistic regression(volunteer ∼ sex + neuroticism*extraversion). The graphs are shown in monochrome; they (and the other effect plots in the chapter) were originally in color.
FIGURE 7.15:Model Effect Plotsdialog box forLinearModel.2 (prestige ∼ education + income + women + type)fit to thePrestigedata.
7.6.1 Partial Residuals in Effect Plots
Addingpartial residualsto effect plots of numeric explanatory variables in linear and generalized linear models can be an effective tool for judging departures from the functional form (linear or otherwise) specified in the model. I’ll illustrate using the Canadian occupational prestige data. In
, I fit several models to thePrestigedata, including an additive dummy-regression model(LinearModel.2in
prestige ∼ education + income + women + type
and a model with interactions(LinearModel.3in
prestige ∼ type*education + type*income
TheR Commandersession in this chapter is unusual in that I’ve read three data sets(Duncan, Prestige, andCowles)and fit statistical models to each. It is much more common to work with a single data set in a session. Nevertheless, as I explained, theR Commanderallows you to switch among models and data sets, and takes care of synchronizing models with the data sets to which they were fit. After makingLinearModel.2the active model, I return to theModel Effect Plotsdialog, displayed in
. I checkPlot partial residualsand clickOK, producing
. Partial residuals are plotted for the numeric predictors but not for the factortype; this is reflected in a warning printed in theMessagespane, which I’ll simply ignore.
The solid lines in the effect plots represent the model fit to the data, while the broken lines are smooths of the partial residuals. If the lines for the fitted model and smoothed partial residuals are similar, that lends support to the specified functional form of the model. The partial residuals are computed by adding the residual for each observation to the line representing the fitted effect. It appears as if theeducationeffect is modelled reasonably, but theincomeandwomeneffects appear to be nonlinear.
LinearModel.3includes interactions betweentypeand each ofeducationandincome.
shows effect plots with partial residuals for the high-order terms in this model.
FIGURE 7.16: Effect displays with partial residuals forLinearModel.2 (prestige ∼ education + income + women + type)fit to thePrestigedata.
FIGURE 7.16: Effect displays with partial residuals forLinearModel.2 (prestige ∼ education + income + women + type)fit to thePrestigedata.
FIGURE 7.17: Effect displays with partial residuals forLinearModel.3 (prestige ∼ type*education + type*income)fit to thePrestigedata.
Because dividing the data bytypeleaves relatively few points in each panel of the plots, I set the span of the smoother to a large value, 0.9.
The apparent nonlinearity in the relationship betweenprestigeandincomeis accounted for by the interaction betweenincomeandtype:
The right-hand display of
shows that theincomeslope is smaller for professional and managerial occupations (i.e.,type = “prof”)than for blue-collar(“bc”)or white-collar(“wc”)occupations, and professional occupations tend to have higher incomes. The display at the left, for theeducation-by-typeinteraction, suggests that theeducationslope is steeper for white-collar occupations than for the other types of occupations. The smooths of the partial residuals indicate that these relationships are linear within the levels oftype.
The confidence envelopes in the effect displays with partial residuals in
also make a useful pedagogical point about precision of estimation of the regression surface: Where data are sparse—or, in the extreme, absent—the regression surface is imprecisely estimated.
LinearModel.6, fit in
prestige ∼ poly(women, degree=2, raw=TRUE) + ns(education, df=5) + ns(income, df=5)
uses a quadratic inwomenalong with regression splines forincomeandeducation, which should capture the unmodelled nonlinearity observed in
; the model doesn’t include the factortype, however. I makeLinearModel.6the active model and repeat the effect plots, which are shown in
. Here, the fitted model and smoothed residuals agree well with each other.
FIGURE 7.18: Effect displays with partial residuals forLinearModel.6 (prestige ∼ ns(education, df=5) + ns(income, df=5) + poly(women, degree=2, raw=TRUE))fit to thePrestigedata.
7.7 Confidence Intervals and Hypothesis Tests for Coefficients
TheModelsmenu includes several menu items for constructing confidence intervals and performing hypothesis tests for regression coefficients. As I explained, tests for individual coefficients in linear and generalized linear models appear in the model summaries. This section describes how to perform more elaborate tests, for example, for a related subset of coefficients.
7.7.1 Confidence Intervals
To illustrate, I’ll make Duncan’s occupational prestige regression(RegModel.1in
) the active statistical model in theR Commander, which automatically makesDuncanthe active data set.
SelectingModels > Confidence intervalsfrom theR Commandermenus leads to the simple dialog box at the top of
Retaining the default 0.95 level of confidence and clickingOKproduces the output at the bottom of the figure.
7.7.2 Analysis of Variance and Analysis of Deviance Tables*
You can compute ananalysis of variance (ANOVA) tablefor a linear model or an analogousanalysis of deviance tablefor a generalized linear model viaModels > Hypothesis tests > ANOVA table. I’ll illustrate with Cowles and Davis’s logistic regression model(GLM.7in
), selecting it as the active model in the session. TheANOVA Tabledialog at the top of
offers three “types” of tests, conventionally namedTypes I, II, andIII:
•In addition to the intercept, there are four terms in the Cowles and Davis model:sex, neuroticism, extraversion, and theneuroticism:extraversioninteraction. Type I tests aresequential, and thus test (in a short-hand terminology)sexignoring everything else;neuroticismaftersexbut ignoringextraversionand theneuroticism:extraversioninteraction;extraversionaftersexandneuroticismignoring the interaction; and theneuroticism:extraversioninteraction after all other terms. Sequential tests are rarely sensible.
•Type II tests are formulated in conformity with the principle of marginality (
):sexafter all other terms, including theneuroticism:extraversioninteraction;neuroticismaftersexandextraversionbut ignoring theneuroticism:extraversioninteraction to which theneuroticism“main effect” is marginal; similarly,extraversionaftersexandneuroticismbut ignoring theneuroticism:extraversioninteraction; andneuroticism:extraversionafter all the other terms. More generally, each term is tested ignoring terms to which it is marginal (i.e., ignoring its higher-order relatives). This is generally a sensible approach and is the default in the dialog.
Type II tests for Cowles and Davis’s logistic regression are shown at the bottom of
. For a generalized linear model like Cowles and Davis’s logistic regression, theR Commandercomputes an analysis of deviance table with likelihood ratio tests, which entails refitting the model (in some instances twice) for each test. In this example, each likelihood ratio chi-square test has one degree of freedom because each term in the model is represented by a single coefficient. Because theneuroticism:extraversioninteraction is highly statistically significant, I won’t interpret the tests for theneuroticismandextraversionmain effects, which assume that the interaction is nil. Thesexeffect is also statistically significant.
FIGURE 7.19:Confidence Intervalsdialog and resulting output for Duncan’s occupational prestige regression(prestige ∼ education + income).
•Type III tests are for each term after all of the others. This isn’t sensible for Cowles and Davis’s model, although it might be rendered sensible, for example, by centering each ofneuroticismandextraversionat their means prior to fitting the model. Even where Type III tests can correspond to sensible hypotheses, such as in analysis of variance models, they require careful formulation inR.
I suggest that you avoid Type III tests unless you know what you’re doing.
FIGURE 7.20:ANOVA Tabledialog and resulting Type II tests for Cowles and Davis’s logistic regression(volunteer ∼ sex + neuroticism*extraversion).
FIGURE 7.20:ANOVA Tabledialog and resulting Type II tests for Cowles and Davis’s logistic regression(volunteer ∼ sex + neuroticism*extraversion).
FIGURE 7.21: Summary of a regression model fit to Duncan’s occupational prestige data forcing theeducationandincomecoefficients to be equal(prestige ∼ I(education + income)).
7.7.3 Tests Comparing Models*
TheR Commanderallows you to compute a likelihood ratioFtest or chi-square test for two regression models, one of which is nested within the other.29To illustrate, I return to theDuncandata set and fit a version of Duncan’s regression that sets the coefficients ofeducationandincomeequal to each other, specifying the linear-model formulaprestige ∼ I(income + education).30This at least arguably makes sense in that both explanatory variables are percentages—respectively of high school graduates and of high-income earners. The resulting model,LinearModel.8, is summarized inFigure 7.21.
PickingModels > Hypothesis tests > Compare two modelsleads to the dialog at the top ofFigure 7.22. I select the more generalRegModel.1as the first model and the more specific, constrained,LinearModel.8as the second model, but the order of the selections is immaterial—the sameFtest is produced in both cases. ClickingOKin the dialog box results in the output at the bottom ofFigure 7.22. The hypothesis of equal coefficients foreducationandincomeis plausible,p= 0.79—after all, the two coefficients are quite similar in the original regression (Figure 7.2onpage 131),beducation= 0.55 andbincome= 0.60.
FIGURE 7.22:Compare Modelsdialog and resulting output for Duncan’s occupational prestige regression, testing the equality of theeducationandincomeregression coefficients.
7.7.4 Testing Linear Hypotheses*
The menu selectionModels > Hypothesis tests > Linear hypothesisallows you to formulate and test general linear hypotheses about the coefficients in a regression model. To illustrate, I’ll again use Duncan’s occupational prestige regression,RegModel.1.Figures 7.23and7.24show theTest Linear Hypothesisdialog set up to test two different hypotheses, along with the corresponding output:
1.InFigure 7.23,H0: 1 ×βeducation– 1 ×βincome= 0 (i.e.,H0:βeducation=βincome). This is the same hypothesis that I tested by the model-comparison approach immediately above, and of course it produces the sameFtest.
2.InFigure 7.24,H0: 1 ×βeducation= 0, 1 ×βincome= 0, which is equivalent to, and thus produces the sameFtest as, the omnibus null hypothesis in the linearmodel summary output, H0:βeducation=βincome= 0 (seeFigure 7.2onpage 131). Because the linear hypothesis consists of two equations, theFstatistic for the hypothesis has twodfin the numerator.
There may be up to as many equations in a linear hypothesis as the number of coefficients in the model, with the number of equations controlled by the slider at the top of the dialog. The equations must be linearly independent of one another—that is, they may not be redundant. Initially, all of the cells in each row are 0, including the cell representing the right-hand side of the hypothesis, which is usually left at 0. TheTest Linear Hypothesisdialog for a linear model provides for an optional “sandwich” estimator of the coefficient covariance matrix, which may be used to adjust statistical inference for autocorrelated or heteroscedastic errors (nonconstant error variance).
FIGURE 7.23: Testing a linear hypothesis for Duncan’s occupational prestige regression:H0:βeducation=βincome.
FIGURE 7.24: Testing a linear hypothesis for Duncan’s occupational prestige regression:H0:βeducation=βincome.= 0.
7.8 Regression Model Diagnostics*
Regression diagnostics are methods for determining whether a regression model that’s been fit to data adequately summarizes the data. For example, is a relationship that’s assumed to be linear actually linear? Do one or a small number of influential cases unduly affect the results?
Many standard methods of regression diagnostics are implemented in theR CommanderModels > Numerical diagnosticsandModels > Graphsmenus—indeed, too many to cover in detail in this already long chapter. Luckily, most of the diagnostics dialogs are entirely straightforward, and some of the diagnostics menu items produce results directly, without invoking a dialog box. I’ll illustrate with Duncan’s occupational prestige regression(RegModel.1inFigure 7.2onpage 131). As usual, I assume that the statistical methods covered here are familiar. Regression diagnostics are taken up in many regression texts; see, in particular, Fox (2016), Weisberg (2014), Cook and Weisberg (1982), or (for a briefer treatment) Fox (1991).
The numerical diagnostics available in theR Commanderincludegeneralized variance-inflation factors(Fox and Monette, 1992) for diagnosing collinearity in linear and generalized linear models, theBreusch–Pagan testfor nonconstant error variance in a linear model (Breusch and Pagan, 1979), independently proposed by Cook and Weisberg (1983), theDurbin–Watson testfor autocorrelated errors in linear time-series regression (Durbin and Watson, 1950, 1951), theRESET testfor nonlinearity in a linear model (Ramsey, 1969), and aBonferonni outlier testbased on thestudentized residualsfrom a linear or generalized linear model (see, e.g., Fox, 2016, Chapter 11).
I’ll illustrate withModels > Numerical diagnostics > Breusch-Pagan test for het-eroscedasticity, leading to the dialog at the top ofFigure 7.25. The default is to test for error variance that increases (or decreases) with the level of the response (through theFitted values), but the dialog is flexible and accommodates dependence of the error variance on the explanatory variables or on a linear predictor based on any variables in the data set. I leave the dialog at its default, producing the output at the bottom ofFigure 7.25. There is, therefore, no evidence that the variance of the errors in Duncan’s regression depends on the level of the response.
There are also many graphical diagnostics available through theR Commander: Basic diagnostic plots produced by theRplotfunction applied to a linear or generalized linear model;residual quantile-comparison plots, for example to diagnose non-normal errors in a linear model;component-plus-residual (partial-residual) plotsfor nonlinearity in additive linear or generalized linear models31;added-variable plotsfor diagnosing unusual and influential data in linear and generalized linear models; and an“influence plot”—a diagnostic graph that simultaneously displays studentized residuals,hat-values (leverages), andCook’s distances.
I’ll selectively demonstrate these graphical diagnostics by applying an influence plot, added-variable plots, and component-plus-residual plots to Duncan’s regression(RegModel.1), making it the active model. I invite the reader to explore the other diagnostics as well.
SelectingModels > Graphs > Influence plotproduces the dialog box at the top ofFigure 7.26. I’ve left all of the selections in the dialog at their default values, including automatic identification of unusual points:32Two cases are selected from each of the most extreme studentized residuals, hat-values, and Cook’s distances, potentially identifying up to six points (although this is unlikely to happen because influential points combine high leverage with a large residual). The resulting graph appears at the bottom of the figure, and, as it turns out, four relatively unusual points are identified: The occupationRR.engineer(railroad engineer) is at a high leverage point but has a small studentized residual;reporterhas a relatively large (negative) studentized residual but small leverage; (railroad)conductorand, particularly,ministerhave comparatively large studentized residuals and moderately high leverage. The areas of the circles are proportional to Cook’s influence measure, sominister, combining a large residual with fairly high leverage, has the most influence on the regression coefficients.
FIGURE 7.25:Breusch-Pagan Testdialog and resulting output for Duncan’s occupational prestige regression(prestige ∼ education + income).
FIGURE 7.26:Influence Plotdialog and resulting graph for Duncan’s occupational prestige regression(prestige ∼ education + income).
Models > Graphs > Added-variable plotsleads to the dialog box at the top ofFigure 7.27; as before, the dialog allows the user to select a method for identifying noteworthy points in the resulting graphs. Once again, I retain the default automatic point identification but now increase the number of points to be identified in each graph from the default two to three.33ClickingOKproduces the graphs at the bottom ofFigure 7.27. The slope of the least-squares line in each added-variable plot is the coefficient for the corresponding explanatory variable in the multiple regression, and the plot shows how the cases influence the coefficient—in effect, transforming the multiple regression into a series of simple regressions, each with other explanatory variables controlled.
FIGURE 7.28:Component+Residual Plotsdialog and resulting graphs for Duncan’s occupational prestige regression(prestige ∼ education + income).
Finally, in theModelsmenu,Add observation statistics to dataallows you to add fitted values (i.e., y), residuals, studentized residuals, hat-values, Cook’s distances, and observations indices (1, 2,…,n) to the active data set. These quantities, which (with the exception of observation indices) are named for the model to which they belong (e.g.,residuals.RegModel.1), may then be used, for example, to create customized diagnostic graphs, such as an index plot of Cook’s distances versus observation indices.
7.9 Model Selection*
TheR CommanderModelsmenu includes modest facilities for comparing regression models and for automatic model selection. The menu itemsModels > Akaike Information Criterion (AIC)andModels > Bayesian Information Criterion (BIC)print the AIC or BIC model selection statistics for the current statistical model.Models > Stepwise model selectionperforms stepwise regression for a linear or generalized linear model, whileModels > Subset model selectionperforms all-subsets regression for a linear model. Although I’ve never beenterribly enthusiastic about automatic model selection methods, I believe that these methods do have a legitimate role, if used carefully, primarily in pure prediction applications.
I’ll illustrate model selection with theEricksendata set in thecarpackage. The data, described by Ericksen et al. (1989), concern the 1980 U. S. Census undercount, and pertain to 16 large cities in the United States, the remaining parts of the states to which these cities belong (e.g., New York State outside of New York City), and the other states. There are, therefore, 66 cases in all. In addition to the estimated percentage undercount in each area, the data set contains a variety of characteristics of the areas. A linear least-squares regression ofundercounton the other variables in the data set reveals that some of the predictors are substantially collinear.
I fit this initial linear model with the formulaundercount∼ . and obtained variance-inflation factors for the regression coefficients viaModels > Numerical diagnostics > Variance-inflation factors. The relevant output is in
ChoosingModels > Subset model selectionproduces the dialog box at the top of
. All of the options in this dialog remain at their defaults, including using the BIC for model selection. Pressing theOKbutton produces the graph at the bottom of
, plotting the “best” model of each sizek= 1,…,9, according to the BIC. The predictors included in each model are represented by filled-in squares, and smaller (i.e., larger-in-magnitude negative) values of the BIC represent “better” models. Notice that the regression intercept is included in all of the models. The best model overall, according to the BIC, includes the four predictorsminority, crime, language, andconventional. I invite the reader to try stepwise model selection as an alternative.
FIGURE 7.29: Regression output and variance-inflation factors for Ericksen et al.’s Census undercount data, fitting the linear modelundercount ∼ ..
FIGURE 7.29: Regression output and variance-inflation factors for Ericksen et al.’s Census undercount data, fitting the linear modelundercount ∼ ..
FIGURE 7.30:Subset Model Selectiondialog and resulting graph for Ericksen et al.’s Census undercount data.
1As explained inSection 7.2,Equation 7.1also serves for more general linear models, where (some of) thexsmay not be numeric explanatory variables but rather dummy regressors representing factors, interaction regressors, polynomial regressors, and so on.
2Recall that you must use the double equals sign (==) to test for equality; see Table 4.4 (page 71).
3Avectoris a one-dimensional array, here of numbers.
4Thesequence operator: creates an integer (whole-number) sequence, so 1:25 generates the integers 1 through 25. The c functioncombinesits arguments into a vector, so -c(6, 16) creates a two-element vector containing the numbers –6 and –16.
5SeeSection 4.4.2, and in particular Table 4.4 (page 71), for information onRexpressions.
6Theargumentsof anRfunction are the values given in parentheses when the function is called; if there is more than one argument, they are separated by commas.
7As formulated by Nelder (1977), the principle of marginality is deeper and more general than this characterization, but thinking of the principle in these simplified terms will do for our purposes.
8See later in this section for an explanation of how factors are handled in linear-model formulas.
9See Table 4.2 (on page 61) for an explanation of the variables in the Prestige data set.
10The second toolbar may be used to enter regression-spline and polynomial terms into the model. I’ll describe this feature inSection 7.3.If you’re unfamiliar with regression splines or polynomials, simply ignore this toolbar.
11TheLinear Modeldialog also includes a box for subsetting the data, and a drop-down variable list for selecting a weight variable forweighted-least-squares(as opposed toordinary-least-squares) regression. Neither subsetting nor a weight variable is used in this example.
12The “T” in the names of the dummy-variable coefficients refers to “treatment contrasts”—a synonym for 0/1 dummy regressors—discussed further inSection 7.2.4.
13That the coefficient of women in the additive model is small doesn’t imply that women and type don’t interact. My true motive here is to simplify the example.
14Recall that a number like-3.354e+01is expressed in scientific notation, and is written conventionally as –3.354 × 101= –33.54.
15The functioncontr.Treatmentis a modified version of the standardRfunctioncontr.treatment; contr.Treatmentgenerates slightly easier to read names for the dummy variables—for example,type[T.wc]rather thantypewc.Similarly,contr.Sumandcontr.Helmert, discussed below, are modifications of the standardRfunctionscontr.sumandcontr.helmert.
16Multi-way analysis of variance in theR Commander, discussed inSection 6.1.3, uses contr.Sum for the factors in the ANOVA model.
17Reader: Can you see how the coefficients for type are related to each other across the two parametrizations of the model?
18The regressors of an orthogonal polynomial are uncorrelated, while those of a raw polynomial are just powers of the variable—for example, women and women2. The fit of raw and orthogonal polynomials to the data is identical: They are just alternative parametrizations of the same regression. Raw polynomials may be preferred for simplicity of interpretation of the individual regression coefficients, but orthogonal polynomials tend to produce more numerically stable computations.
19Also see the discussion of partial residuals in effect plots inSection 7.6.
20To make this example a little simpler, I’ve omitted occupational type from the regression.
21In this revised model, however, where the partial relationships of prestige to education and income are modeled more adequately, the quadratic coefficient for women is statistically significant, withp= 0.01.
22I’m grateful to Caroline Davis of York University for making the data available.
23You’re not constrained to select focal explanatory variables that correspond to high-order terms in the model, or even to terms in the model. For example, for Cowles and Davis’s logistic regression, you could select all three explanatory variables, sex, neuroticism, and extraversion, even though the three-way interaction among these variables isn’t in the model. In this case, the effect plot would be drawn for combinations of the values of the three explanatory variables.
24This strategy is used in general for effect plots of generalized linear models in theR Commander: The vertical axis is drawn on the scale of the linear predictor—the scale on which the model is linear—but labelled on the generally more interpretable scale of the response.
25Smoothing scatterplots is discussed inSection 5.4.
26Recall, however, that the explanatory variable women isn’t included in this model.
27For a generalized linear model, theConfidence Intervalsdialog provides an option to base confidence intervals on either the likelihood ratio statistic or on the Wald statistic. The former requires more computation, but it is the default because confidence intervals based on the likelihood ratio statistic tend to be more accurate.
28For Type III tests to address sensible hypotheses, the contrasts used for factors in an ANOVA model must be orthogonal in the basis of the design. The functions contr.Sum, contr.Helmert, and contr.poly produce contrasts with this property, but theR Commanderdefault dummy-coded contr.Treatment does not. For this reason, theR CommanderMulti-Way Analysis of Variancedialog (Section 6.1.3) uses contr.Sum to fit ANOVA models (see, in particular, Figures 6.9 and 6.10, pages 118–119), and so the resulting models can legitimately employ Type III tests.
29F tests are computed for linear models and for generalized linear models (such as quasi-Poisson models) for which there’s an estimated dispersion parameter; chi-square tests are computed for generalized linear models (such as binomial models) for which the dispersion parameter is fixed. The same is true of analysis of variance and analysis of deviance tables.
30Recall that the identity (or inhibit) function I is needed here so that + is interpreted as addition.
31InSection 7.6, I showed how to add partial residuals to effect displays. For an additive model, that approach produces traditional component-plus-residual plots, but it is more flexible in that it can be applied as well to models with interactions.
32The defaultAutomaticpoint identification has the advantage of working in theR Markdowndocument produced by theR Commander. As explained inSection 5.4.4, graphs that require direct interaction are not included in theR Markdowndocument.
33If you want to experiment with automatic identification of differing numbers of points, press theApplybutton rather thanOK.
34I’ll leave it as an exercise for the reader to remove the occupations minister (case 6) and conductor (case 16) from the data and refit the regression—most conveniently via theSubset expressionbox in theLinear ModelorLinear Regressiondialog; you can use the subset -c(6, 16).
35Ericksen et al. (1989) performed a more sophisticated weighted-least-squares regression.
36All-subsets regression in the R Commander is performed by the regsubsets function in the leaps package (Lumley and Miller, 2009), while stepwise regression is performed by the stepAIC function in the MASS package (Venables and Ripley, 2002). Although the distinction is not relevant to this example, where the full model is additive and all terms have one degree of freedom, stepAIC respects the structure of the model—for example, keeping dummy variables for a factor together and considering only models that obey the principle of marginality—while regsubsets does not.
RCH 8303, Quantitative Data Analysis 1
Course Learning Outcomes for Unit VII
Upon completion of this unit, students should be able to:
1. Perform statistical tests using software tools.
1.1 Perform simple linear regression using appropriate data file and menu options.
2. Explain results of statistical tests.
2.1 Describe the selection process of the variables in the data file.
2.2 Discuss the differences between alternative hypotheses
2.3 Elaborate on options available for missing or incomplete data.
2.4 Describe the assumptions for simple linear regression.
2.5 Contrast the differences between association and prediction.
2.6 Describe homoscedasticity.
2.7 Describe dummy-coding and when this would be used in regression.
3. Judge whether null hypotheses should be rejected or maintained.
3.1 Explain the differences between the null and alternative hypotheses, and perform option
3.2 Explain the difference between R and R².
Chapter 7, pp 129–144
Unit VII Assignment 2
2.1, 2.2, 2.3, 2.4, 2.5,
Unit VII Assignment 1
Unit VII Assignment 2
Required Unit Resources
Chapter 7: Fitting Linear and Generalized Linear Models, pp. 129–144
Unit VII Plan
The Unit VII assignment will be in two parts. Part 1 of your assignment requires you to complete one module
of the CITI Program EOSA that relates directly to this readings in this unit. The module has a final quiz that
must be completed and successfully passed, demonstrating your knowledge of basic statistics and the
For Part 2, you will review how to conduct a simple linear regression and determine whether the test is
statistically significant or not.
There is one topic for the Unit VII CITI EOSA course.
Simple Linear Regression (ID 17634): This module describes and explains differences among association,
prediction, and causality. The module describes the assumptions of linear regression and what to do if the
UNIT VII STUDY GUIDE
Simple Linear Regression
RCH 8303, Quantitative Data Analysis 2
UNIT x STUDY GUIDE
data violate one or more of the assumptions. The module also displays how to enter continuous,
dichotomous, and categorical predictors into a regression model.
Unit VII Lesson
Unit VII starts on a different type of outcome form of testing. Units IV, V, and VI conducted tests that
compared the means, and in some cases, causation or a relationship could be determined. The focus of Unit
VII is regression, which is a methodology that allows the researcher to use multiple predictor (independent)
variables to explain variability in the researcher’s outcome (dependent) variable. An example of this could be
whether the researcher could explain the variability in the outcome variable cancer using the predictor
variable “smoking”? Another way of looking at this could be, “Can smoking be a predictor of cancer?” A
researcher could gather data on whether smoking could or would predict cancer in a sample of smokers.
R and R Commander make it very easy to conduct simple statistical tests.
As noted in Unit III, once data are collected, a researcher needs to be able to describe, summarize, and,
potentially, detect patterns in the data they have recorded with meaningful numerical scales, such as
histograms. After reviewing the data, decisions must be made regarding whether the assumptions of the
particular test have been met. If they have, then conducting of the test can proceed. Tutorials are provided in
the Dissertation Center for this lesson. Reviewing the Testing for Normality will be very helpful to you.
Before conducting any statistical test, though, the researcher must first meet the assumptions of the particular
test. The Simple Linear Regression (ID 17634) module describes and explains each of the assumptions for
For an example of simple linear regression, make sure when you access R that you also load R Commander.
Type in library(Rcmdr) or see Unit I for a refresher on how to gain access to R Commander. Once R and
R Commander have been loaded, the next step is to load the data set wtandruntimes1 that will be used
RCH 8303, Quantitative Data Analysis 3
UNIT x STUDY GUIDE
Data Set Wtandruntimes1 Successfully Uploaded
RCH 8303, Quantitative Data Analysis 4
UNIT x STUDY GUIDE
Viewing the data set allows a user to examine the type of category information and numeric values (Figure 2).
Visual Representation of Wtandruntimes1 Data Set
RCH 8303, Quantitative Data Analysis 5
UNIT x STUDY GUIDE
In this test, our research question and hypotheses could be written as:
RQ: Does a person’s weight predict their runtime?
H0: A person’s weight does not predict their runtime.
HA: A person’s weight predicts their runtime.
The first step is to view a scatterplot. In a scatterplot, one variable is plotted on the x-axis and the other
variable is plotted on the y-axis. Select Graphs and Scatterplot (Figure 3).
Scatterplot selection menu
RCH 8303, Quantitative Data Analysis 6
UNIT x STUDY GUIDE
Once the menu item is selected, place Weight variable in the x-axis and the runtimes variable in the y-axis
Scatterplot Variable Selection Menu
RCH 8303, Quantitative Data Analysis 7
UNIT x STUDY GUIDE
Press “OK,” and view the scatterplot in the output window (Figure 5).
Note the scatterplot illustrates a positive (upward) line with six data points results not on the regression line.
RCH 8303, Quantitative Data Analysis 8
UNIT x STUDY GUIDE
Our next step is to run the linear regression to determine whether our Weight predictor (independent) variable
to explain variability in the researcher’s runtimes outcome (dependent) variable (Figure 6).
Linear Regression Selection Menu
RCH 8303, Quantitative Data Analysis 9
UNIT x STUDY GUIDE
Once linear regression is selected, you must place the Weight and runtimes variable into the appropriate
Explanatory or Response position. Fox (2017) explains the terms used as the dependent and independent
variables on page 129. In our case, our dependent variable (Response variable) is runtimes, and our
independent variable (Explanatory variable) is Weight (Figure 7).
Variable Selection Menu
RCH 8303, Quantitative Data Analysis 10
UNIT x STUDY GUIDE
Once “OK” is pressed, the results of the simple linear regression is displayed in the output window (Figure 8).
Simple Linear Regression Test Output Display
Note that from the output, the model is significant (p < .001), and that 72.7% of the change in runtimes can be attributed to a person’s weight. RCH 8303, Quantitative Data Analysis 11 UNIT x STUDY GUIDE Title Next, you must examine the regression diagnostics to address the assumptions of the test (Figure 9). Figure 9 Regression Diagnostics Menu Option RCH 8303, Quantitative Data Analysis 12 UNIT x STUDY GUIDE Title Once “OK” is selected, the diagnostic graphs are the output (Figure 10). Figure 10 Regression Diagnostic Output Pages 163–170 of the textbook discuss various options applicable to your specific test. RCH 8303, Quantitative Data Analysis 13 UNIT x STUDY GUIDE Title The results of this test could be written as: A simple regression analysis was performed to examine whether a person’s weight predicts their runtime. The results of the test was significant, F(1,18) = 47.91, p < .001, R2 = .727. This can be interpreted as 73% of the change in runtime is attributable to a person’s weight. In conclusion, the simple regression test discussed in this unit has asked the question; Can I predict which variable has the most effect on the dependent variable? Or, put another way; Can runtimes be predicted with weight scores? Our final unit, Unit VIII, will expand on regression. However, instead of only having one predictor variable (independent variable), Unit VIII focuses on multiple regression and the researcher can have many independent variables. In Unit VIII, we will be asking the following question; Can I predict which variable(s) has the most effect on the dependent variable? Or put another way, Can runtimes be predicted with multiple variables? Reference Fox, J. (2017). Using the R Commander: A point-and-click interface for R. CRC Press. Learning Activities (Nongraded) Nongraded Learning Activities are provided to aid students in their course of study. You do not have to submit them. If you have questions, contact your instructor for further guidance and information. When studying APA formatting, pay particular attention to the sections that pertain to formatting for research and statistics. Course Learning Outcomes for Unit VII Required Unit Resources Unit Lesson Unit VII Plan Unit VII Lesson Reference Learning Activities (Nongraded)
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.Read more
Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.Read more
Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.Read more
Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.Read more
By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.Read more