4 Multiple Linear Regression

This chapter focuses on how to perform a multiple linear regression: regression with more than 1 independent variable. In addition, it will discuss how to obtain standardized coefficients.

As always, we begin our R script by loading relevant libraries and by loading our data. These libraries are already installed on university computers, but must be loaded prior to use.

#Packages
library(rio)             #loading data
library(tidyverse)       #data manipulation and plotting
library(broom)           #for obtaining summaries of regression models
library(parameters)      #for calculating standardized coefficients

##Import data
demdata <- import("demdata.rds") |> 
  as_tibble()

1: You do not always need as_tibble() as shown here. We do this here because it is easier to show larger datasets with them. See Statistics I, 2.1.

4.1 Performing a Multiple Linear Regression

In this example we will regress a measure of the level of electoral democracy in a country (v2x_polyarchy) on three predictor variables (2 continuous and 1 binary):

cpi: CPI stands for “corruption perception index” and is a measure of the extent of corruption in the country’s public sector; higher values on this variable indicate less corruption
v2caviol: This is a measure concerning the extent of political violence by non-state actors with higher values associated with higher levels of political violence
TYPEDEMO1984: A binary variable indicating whether the country was a democracy or an autocracy in 1984.

Recall that we include binary/categorical variables in regression models by first converting them into a factor variable.

#Convert binary variable into a factor
demdata <- demdata |> 
  mutate(TYPEDEMO1984_factor = factorize(TYPEDEMO1984))

1: factorize() can be used here because this variable has value labels. Otherwise, we would need to use factor() and supply the labels ourselves. See Section A.2 .

We fit a multiple linear regression using the same command that we used to perform a bivariate regression: lm(). We add multiple predictor variables to the model via the use of a + sign as in this example:

#Run and store the model 
model_multiple <- lm(v2x_polyarchy ~ cpi + v2caviol + TYPEDEMO1984_factor, 
                     data=demdata)

model_multiple <-: Here we tell R to create a new data object called model_multiple that will store the results from our regression. You would change this to a name of your choosing in your examples.
lm(v2x_polyarchy ~: Here we tell R that we want to perform a linear regression (lm() = the command for a linear model) and that the dependent variable is named v2x_polyarchy. This variable is placed to the left of the tilde (~).
cpi + v2caviol + TYPEDEMO1984_factor,: Here we tell R what independent variables we want to include in the model. We separate each variable with a + sign. Changing the order of the independent variables (e.g., TYPEDEMO1984 + v2caviol + cpi) would produce the same model results.
data = demdata): Finally, we tell R the name of the object where our data is stored. This information comes after a ‘,’ after the final independent variable.

We can obtain a summary of our coefficients with the summary() command:

summary(model_multiple)


Call:
lm(formula = v2x_polyarchy ~ cpi + v2caviol + TYPEDEMO1984_factor, 
    data = demdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56402 -0.09376  0.01442  0.12926  0.34206 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     0.187394   0.042634   4.395 2.19e-05 ***
cpi                             0.006365   0.001059   6.012 1.55e-08 ***
v2caviol                       -0.008724   0.012258  -0.712    0.478    
TYPEDEMO1984_factorDemocracies  0.152698   0.034915   4.373 2.39e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1807 on 138 degrees of freedom
  (37 observations deleted due to missingness)
Multiple R-squared:  0.5068,    Adjusted R-squared:  0.4961 
F-statistic: 47.26 on 3 and 138 DF,  p-value: < 2.2e-16

Interpretation

How we interpret the coefficients is quite similar to how we did with bivariate models, but we now have to incorporate the fact that there are multiple variables in the model into our understanding.

The “(Intercept)” value tells us what value we should expect to observe, on average, when all of the included independent variables = 0. If we could observe countries with a value of 0 on the cpi variable AND a value of 0 on v2caviol AND which were coded as an autocracy in 1984, then we’d expect to observe an average electoral democracy score of 0.19.

The coefficients for the independent variables can continue to be interpreted as telling us about the slope of a line (continuous variables) or the difference between categories (factor variables). However, they now tell us about the effect of the independent variable while “holding the effect of the other (predictor) variables constant”. For instance:

v2caviol: Electoral democracy scores are expected to decrease by -0.01 scale points with each one unit increase on the political violence scale, holding the effects of prior regime status and corruption constant.
TYPEDEMO1984_factor: If we were to compare countries with the same level of political violence and corruption, then we’d expect the average 2020 electoral democracy score to be 0.15 scale points higher in countries coded as “Democracies” in 1984 than those coded as “Autocracies”.

4.2 Standardized Coefficients

Researchers sometimes report standardized coefficients rather than the default unstandardized coefficients. We can use the standardize_parameters() function from the parameters package to obtain these results.

multiple_std <- standardize_parameters(model_multiple, 
                       method = "refit")

Here is how to read the syntax above:

multiple_std <-: This assigns the results of our command to a new data object called multiple_std.
standardize_parameters(: This is the name of the command.
model_multiple,: This is the name of the linear regression model object that we have previously saved and which we want to standardize.
method = 'refit'): This specifies what type of standardization we want. The refit option is the default option. It will, in the background, standardize the DV and IVs in the model and then refit the regression model using the standardized versions of the variables.

standardize_parameters() produces a dataframe with the following columns:

glimpse(multiple_std)

Rows: 4
Columns: 5
$ Parameter       <chr> "(Intercept)", "cpi", "v2caviol", "TYPEDEMO1984_factor…
$ Std_Coefficient <dbl> -0.23661987, 0.49272847, -0.05393177, 0.60000039
$ CI              <dbl> 0.95, 0.95, 0.95, 0.95
$ CI_low          <dbl> -0.3957424, 0.3306681, -0.2037814, 0.3287296
$ CI_high         <dbl> -0.07749731, 0.65478881, 0.09591783, 0.87127121

Output Explanation

Parameter: This provides the name of the terms/variables in the model
Std_Coefficient: The value of the standardized coefficient associated with each variable
CI: The level of the confidence interval for the standardized coefficient
CI_low and CI_high: The lower and upper bounds of the confidence interval. These values will be combined into one cell when we call the results below.

Let us see the results and how they compare to our original results. We will use tidy() to simplify the output of the original model.

#Original Results
tidy(model_multiple)

# A tibble: 4 × 5
  term                           estimate std.error statistic      p.value
  <chr>                             <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)                     0.187     0.0426      4.40  0.0000219   
2 cpi                             0.00636   0.00106     6.01  0.0000000155
3 v2caviol                       -0.00872   0.0123     -0.712 0.478       
4 TYPEDEMO1984_factorDemocracies  0.153     0.0349      4.37  0.0000239

#Standardized
multiple_std

# Standardization method: refit

Parameter                         | Std. Coef. |         95% CI
---------------------------------------------------------------
(Intercept)                       |      -0.24 | [-0.40, -0.08]
cpi                               |       0.49 | [ 0.33,  0.65]
v2caviol                          |      -0.05 | [-0.20,  0.10]
TYPEDEMO1984 factor [Democracies] |       0.60 | [ 0.33,  0.87]

The standardized coefficients for the continuous variables in this example indicate how many standard deviations the DV is expected to change when the continuous predictor changes by 1 standard deviation.¹

standardize_parameters() produces a standardized difference score for factor predictor variables that is equivalent to dividing the difference in the expected mean value of the DV between the two groups being compared by the standard deviation of the dependent variable.²

Interpretation

Democracy scores are expected to decrease by -0.05 standard deviations given a one standard deviation increase on the political violence scale (holding the effect of corruption and prior regime status constant).

Countries that were democratic in 1984 are expected to be 0.6 standard deviations more democratic in the year 2020, on average, than countries that were autocratic in 1984 (holding constant the effect of corruption and political violence).

Warning!

You may have noticed that we did not use summary() or tidy() to look at the standardized coefficients. Those commands are not needed when we are looking at the stored output of standardize_parameters() because that output is already a dataframe.

Using summary() would produce summary statistics for each column in the output as seen here:

summary(multiple_std)

  Parameter         Std_Coefficient         CI           CI_low        
 Length:4           Min.   :-0.2366   Min.   :0.95   Min.   :-0.39574  
 Class :character   1st Qu.:-0.0996   1st Qu.:0.95   1st Qu.:-0.25177  
 Mode  :character   Median : 0.2194   Median :0.95   Median : 0.06247  
                    Mean   : 0.2005   Mean   :0.95   Mean   : 0.01497  
                    3rd Qu.: 0.5195   3rd Qu.:0.95   3rd Qu.: 0.32921  
                    Max.   : 0.6000   Max.   :0.95   Max.   : 0.33067  
    CI_high        
 Min.   :-0.07750  
 1st Qu.: 0.05256  
 Median : 0.37535  
 Mean   : 0.38612  
 3rd Qu.: 0.70891  
 Max.   : 0.87127

Using tidy() on it would produce an error because tidy() is meant for use with objects created from statistical models:

tidy(multiple_std)

Warning in tidy.data.frame(multiple_std): Data frame tidiers are deprecated and
will be removed in an upcoming release of broom.

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm =
na.rm): NAs introduced by coercion

Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
argument is not numeric or logical: returning NA

Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA

Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]):
argument is not numeric or logical: returning NA

Error in x - stats::median(x, na.rm = na.rm): non-numeric argument to binary operator

This is true for the “refit” version of standardization used here. We could ask this command to only standardize the predictor variables while leaving the DV on its original scale by including the option “include_response = F” in our command. The standardized continuous predictor would be interpreted as telling us how much the mean of Y (on its original scale) is expected to change given a 1 standard deviation change in X. This would be sensible in cases where the scale of our dependent variable is quite easy to understand and where standardizing the DV complicates interpretations. For instance, if we were predicting the percentage of votes cast for a political party, it is probably easier to interpret “a 1 standard deviation change in X is associated with a gain of 2% more votes for the party” than “a 1 standard deviation change in X is associated with a gain of 0.3 standard deviations in votes”.↩︎
This standardized difference is not directly comparable to the standardized coefficients for the continuous variables in an important sense. The standardized difference for a factor variable tells us what happens when X changes by its full range, i.e., when X goes from 0 to 1. The standardized coefficients for the continuous variables tell us what happens when X changes by 1 standard deviation…but that is only part of the range of the variable. We can obtain more directly comparative standardized coefficients by including the option two_sd = TRUE in our command. This scales the continuous variables by two standard deviations rather than one and thus gives a better approximation of what a full change in the continuous X leads to.↩︎