5 Predicted & Residual Values

Regression models make predictions about the value of the DV we expect to see given the value of the independent variable(s) in the model. We can use R to investigate predicted values in closer detail to better understand the results of our model.

As always, we begin our R script by loading relevant libraries and by loading our data. These libraries are already installed on university computers, but must be loaded prior to use. We will also run and store the linear regressions that will be used in the subsequent examples.

library(rio)              #loading data
library(tidyverse)        #data manipulation and plotting
library(marginaleffects) #calculating predicted values

# Load data and some data management
demdata <- import("demdata.rds") |> 
  as_tibble()

demdata <- demdata |> 
    mutate(TYPEDEMO1984_factor = factorize(TYPEDEMO1984), 
           Typeregime2006_factor = factorize(Typeregime2006))

# Models for examples
model_continuous <- lm(v2x_polyarchy ~ gini_2019, data = demdata)

model_binary <- lm(v2x_polyarchy ~ TYPEDEMO1984_factor, data=demdata)

model_categorical <- lm(v2x_polyarchy ~ Typeregime2006_factor, data=demdata)

model_multiple <- lm(v2x_polyarchy ~ cpi + v2caviol + TYPEDEMO1984_factor, data=demdata)

5.1 Predicted Values and Residuals for Individual Observations

Linear regression models make predictions about the value of the DV we should expect to observe for each observation used in fitting the model. These unit-level predictions will likely differ from the actual observed value of the DV by some amount; we call the difference between observed and predicted value “residuals” or “prediction errors”.

We can use the predictions() command from the marginaleffects package to see what our model predicts, and how wrong those predictions are, for each observation used in fitting the model.¹

model_binary_predictions <- predictions(model_binary, newdata = demdata) |> 
  as_tibble() #as_tibble() used for ease of display; see Warning below

Here is how to read this syntax:

model_binary_predictions: We are assigning the output of our command to a new object called model_binary_predictions. You would name this whatever you wish in your examples.
predictions(model_binary,: The name of the command is predictions. We place within the parentheses the name of the data object where we have stored our model results. You should change this name to whatever you have named the model you want to make predictions from.
newdata = demdata): Here we tell the command where the data originally comes from. This has a specific purpose: it tells the command that the output it creates should include both the predicted values for each observation in the original dataset and all of the other variables in the original dataset for these observations. This can be useful for investigating the characteristics of observations that have particularly sizable residual values. Omitting newdata = demdata, on the other hand, would produce an object that contains predicted values for each observation used in fitting the model as well as their observed values on the variables using in the model but not other variables from the original dataset. You would change demdata to the name of the data object used when fitting your regression model.

Here is what the output looks like:

model_binary_predictions

# A tibble: 179 × 52
   rowid estimate std.error statistic    p.value s.value conf.low conf.high
   <int>    <dbl>     <dbl>     <dbl>      <dbl>   <dbl>    <dbl>     <dbl>
 1     1    0.690    0.0287      24.1  3.16e-128    424.    0.634     0.746
 2     2    0.418    0.0233      17.9  1.16e- 71    236.    0.372     0.463
 3     3    0.690    0.0287      24.1  3.16e-128    424.    0.634     0.746
 4     4    0.690    0.0287      24.1  3.16e-128    424.    0.634     0.746
 5     5    0.418    0.0233      17.9  1.16e- 71    236.    0.372     0.463
 6     6    0.418    0.0233      17.9  1.16e- 71    236.    0.372     0.463
 7     7    0.690    0.0287      24.1  3.16e-128    424.    0.634     0.746
 8     8    0.418    0.0233      17.9  1.16e- 71    236.    0.372     0.463
 9     9   NA       NA           NA   NA             NA    NA        NA    
10    10    0.418    0.0233      17.9  1.16e- 71    236.    0.372     0.463
# ℹ 169 more rows
# ℹ 44 more variables: country_name <chr>, year <dbl>, v2x_polyarchy <dbl>,
#   v2x_libdem <dbl>, v2x_egaldem <dbl>, v2cacamps <dbl>, v2caviol <dbl>,
#   e_peaveduc <dbl>, cpi <dbl>, e_regiongeo <dbl>, e_regionpol_6C <dbl>,
#   v2elcomvot <dbl>, compulsory_voting <dbl>, bicameral <dbl>, dem_diff <dbl>,
#   dem_increase <dbl>, dem_decrease <dbl>, TypeSoc2005 <dbl>,
#   TypeEcon2006 <dbl>, HDI2005 <dbl>, GDP2006 <dbl>, TYPEDEMO1984 <dbl>, …

Output Explanation

estimate: This is the predicted value of the DV for the observation based on the regression model
std.error, statistic, p.value, conf.low, conf.high: These report the standard error, test statistic, p-value, and the lower and upper bounds of the 95% confidence interval for the estimate. They tell us about the uncertainty around the prediction. The s.value column provides another way of thinking about uncertainty, but one we won’t cover; it is non-examinable.²
The ensuing columns in the data frame are the variables from the original dataset and are noted at the bottom of the output (“43 more variables: country_name <chr>, …”).

We can use this saved data object to calculate the residual value for each observation: the difference between the actual value on the DV and what our model predicts we should observe for the observation by creating a new variable where we subtract “estimate” from our DV.

model_binary_predictions <- model_binary_predictions |> 
  mutate(residual_value = v2x_polyarchy - estimate) #residual = actual - predicted

We could then use this newly created variable to investigate the results of our model and consider, for instance, which observations have particularly large residual values. This can be useful when checking the assumptions of our model, which we will learn how to do in Chapter 7 .³

model_binary_predictions |> 
  select(country_name, v2x_polyarchy, estimate, residual_value)

# A tibble: 179 × 4
   country_name v2x_polyarchy estimate residual_value
   <chr>                <dbl>    <dbl>          <dbl>
 1 Mexico               0.647    0.690        -0.0432
 2 Suriname             0.761    0.418         0.343 
 3 Sweden               0.908    0.690         0.218 
 4 Switzerland          0.894    0.690         0.204 
 5 Ghana                0.72     0.418         0.302 
 6 South Africa         0.703    0.418         0.285 
 7 Japan                0.832    0.690         0.142 
 8 Myanmar              0.436    0.418         0.0184
 9 Russia               0.262   NA            NA     
10 Albania              0.485    0.418         0.0674
# ℹ 169 more rows

5.2 Predicted Values at Specific Values of the IV (Bivariate Model)

We just saw how to obtain model predictions for each individual observation used in creating the model. We may also want to know what we should expect to see, on average, for observations with some specific value of an independent variable. For instance, what is the democracy score we should expect to observe among countries that were autocracies in 1984? Or, what is the democracy score we should expect to observe (based on our model) for countries with a very low level of inequality (say, 25 on our scale)? We can also use the predictions() function to realize this end.

As a first example, we will use predictions() to calculate the average predicted value among countries that were either an autocracy in 1984 or a democracy in 1984 based on our bivariate model (model_binary).

predictions(model_binary, 
            by = "TYPEDEMO1984_factor") |> 
  as_tibble()

1: See the Warning box below for why we include the as_tibble() line of syntax here

# A tibble: 2 × 9
  TYPEDEMO1984_factor estimate std.error statistic   p.value s.value conf.low
  <fct>                  <dbl>     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>
1 Autocracies            0.418    0.0233      17.9 1.16e- 71    236.    0.372
2 Democracies            0.690    0.0287      24.1 3.16e-128    424.    0.634
# ℹ 2 more variables: conf.high <dbl>, df <dbl>

predictions(model_binary,: We first specify the name of the command and then the model from which we want to make predictions.
by = "TYPEDEMO1984_factor"): This tells the command that we want to make predictions “by level of” the variable listed in the parentheses. In this case, we get a prediction for each of category of this variable. We can note two things here. First, we use by= option with factor variables and a different method for continuous variables (see below). Second, we do not specify newdata= here because we want output with average predictions for both categories rather than output that contains data on all individual observations.

We can also do this in situations where the predictor variable is continuous in nature. For instance, we might predict the v2x_polyarchy score with a measure of economic inequality (gini_2019) and then wonder: what is the democracy score we expect to observe for a country with a rather low level of inequality (25) versus one with a rather high level of inequality (45)?

predictions(model_continuous, 
            newdata = datagrid(gini_2019 = c(25, 45))) |> 
  as_tibble()

1: See the Warning box below for why we include the as_tibble() line of syntax here

# A tibble: 2 × 11
  rowid estimate std.error statistic  p.value s.value conf.low conf.high
  <int>    <dbl>     <dbl>     <dbl>    <dbl>   <dbl>    <dbl>     <dbl>
1     1    0.764    0.0451      16.9 3.19e-64   211.     0.675     0.852
2     2    0.527    0.0493      10.7 1.10e-26    86.2    0.430     0.623
# ℹ 3 more variables: v2x_polyarchy <dbl>, gini_2019 <dbl>, df <dbl>

newdata = datagrid(gini_2019 = c(25, 45)): This is how we obtain predictions for specific values of a continuous variable for which we want a prediction. newdata = datagrid( can remain the same in your examples. You would then change gini_2019 to the name of the continuous variable of interest to you. The contents of c() (here: c(25, 45)) dictate what values of the specified variable predictions() will make a prediction for; in this case, for the values of 25 and 45. gini_2019 is a numeric variable, so we can provide the numeric values of interest to us without parentheses. We can also use newdata = datagrid() with factor variables, although this is slightly more effort than simply using by = and would require us to provide the name of each category in parentheses.

We can naturally expand this so that we look at predictions for other values of this variable by including additional options in c():

predictions(model_continuous, 
            newdata = datagrid(gini_2019 = c(25, 30, 35, 40, 45))) |> 
  as_tibble()

1: See the Warning box below for why we include the as_tibble() line of syntax here

# A tibble: 5 × 11
  rowid estimate std.error statistic   p.value s.value conf.low conf.high
  <int>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1     1    0.764    0.0451      16.9 3.19e- 64   211.     0.675     0.852
2     2    0.705    0.0316      22.3 2.20e-110   364.     0.643     0.766
3     3    0.645    0.0267      24.2 6.30e-129   426.     0.593     0.698
4     4    0.586    0.0345      17.0 1.04e- 64   213.     0.518     0.654
5     5    0.527    0.0493      10.7 1.10e- 26    86.2    0.430     0.623
# ℹ 3 more variables: v2x_polyarchy <dbl>, gini_2019 <dbl>, df <dbl>

Warning!

We ended our predictions() commands above with another line that specified as_tibble(). This last step is not necessary for the command to work. Here is an example from above but without that final line:

predictions(model_binary, by = "TYPEDEMO1984_factor")


 TYPEDEMO1984_factor Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
         Autocracies    0.418     0.0233 17.9   <0.001 235.6 0.372  0.463
         Democracies    0.690     0.0287 24.1   <0.001 423.5 0.634  0.746

Type: response

The key difference here is in how the output is displayed in R: the default output of predictions() in the Console shows different names for the columns (e.g., Estimate rather than estimate, 2.5% rather than conf.low) to make things look a bit neater (you can see the actual names of the columns in the dataframe that predictions() produces in the final row of the output). We used the as_tibble() option above so that you could see the underlying data and variable names. We will use the output of this command in later chapters to produce plots of our data. The main warning here is to make sure you use the correct variable names (e.g., estimate rather than Estimate).

5.3 Predicted Values (Multiple Linear Models)

We can naturally use these commands to obtain the residuals and predicted values from a model with multiple predictors. The process for finding the residuals of a multiple linear regression model is the same as above so we do not show it again. The process for calculating average predicted values is also highly similar, but with one important difference when the predictor is a factor variable.

Warning!

Newer versions of the marginaleffects package can generate an error when calculating predicted values based on a model that includes a factor variable created using the factorize() command from the rio package. The specific issue arises when the variable has a label for a category but where there is no actual observations in the dataset falling under that category (e.g, -9 = “Missing”, but there 0 observations with a score of -9/“Missing” in the datafile). In this situation, predictions() will try to calculated a predicted value for that category but generate an error because it does not have data to work with. This issue, and how to deal with it, is described in more detail in Section A.4 .

5.3.1 Predictions for a Continuous Predictor Variable

Here are the results from our multiple linear regression model:

tidy(model_multiple)

# A tibble: 4 × 5
  term                           estimate std.error statistic      p.value
  <chr>                             <dbl>     <dbl>     <dbl>        <dbl>
1 (Intercept)                     0.187     0.0426      4.40  0.0000219   
2 cpi                             0.00636   0.00106     6.01  0.0000000155
3 v2caviol                       -0.00872   0.0123     -0.712 0.478       
4 TYPEDEMO1984_factorDemocracies  0.153     0.0349      4.37  0.0000239

cpi is a measure of perceived corruption in a country. It can theoretically range from 0-100 (with higher values indicating less perceived corruption), although it only ranges between 12 and 88 among countries in our model.

predictions(model_multiple) |>
  select(cpi) |>
  summary()

1: Used to filter out any observations in our dataset that are not in the model due to missingness on one or more of the variables in the model
2: Selects just the the cpi variable
3: Find its summary statistics

      cpi       
 Min.   :12.00  
 1st Qu.:28.00  
 Median :39.50  
 Mean   :43.37  
 3rd Qu.:56.75  
 Max.   :88.00

We might ask ourselves this question: what is the level of democracy we should expect to observe based on our model for countries with different levels of cpi but representative values on the other predictor variables? The coefficient for cpi indicates that the predicted value of democracy should increase with each one unit increase in cpi…but do democracy scores change by a little, or a lot, when we move across the range of the corruption variable? Here is how we would use predictions() to obtain the expected democracy score when cpi takes on values between 20 and 80 in 10pt. increments in order to answer that question:

preds1 <- predictions(model_multiple, 
            newdata = datagrid(cpi = c(20,30,40,50,60,70,80))) |> 
  as_tibble()

preds1 <-: We save our results as a new data object so that we can use it again later. We name it preds1 here, while you would give it a name of your choosing.
predictions(model_multiple,: We then specify the name of our command and the name of the model we want to make predictions from.
newdata = datagrid(cpi = c(20,30,40,50,60,70,80)): We then specify what IV, and what specific values of this IV, we want to make predictions for. We use the newdata = datagrid() option because “cpi” is a continuous variable. We do not need to put the values in quotation marks because it is a numerical variable.

Here are the predictions.

preds1

# A tibble: 7 × 13
  rowid estimate std.error statistic   p.value s.value conf.low conf.high
  <dbl>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1     1    0.318    0.0278      11.4 2.41e- 30    98.4    0.264     0.373
2     2    0.382    0.0217      17.6 3.30e- 69   227.     0.339     0.424
3     3    0.445    0.0199      22.4 2.59e-111   367.     0.406     0.484
4     4    0.509    0.0233      21.9 6.10e-106   350.     0.463     0.555
5     5    0.573    0.0302      18.9 4.92e- 80   263.     0.513     0.632
6     6    0.636    0.0389      16.4 2.78e- 60   198.     0.560     0.713
7     7    0.700    0.0483      14.5 1.17e- 47   156.     0.605     0.795
# ℹ 5 more variables: TYPEDEMO1984_factor <fct>, v2caviol <dbl>,
#   v2x_polyarchy <dbl>, cpi <dbl>, df <dbl>

Output Explanation

estimate: The predicted value
std.error through conf.high: Uncertainty estimates relating to the prediction (e.g., its standard error, the p-value, and confidence intervals).
“4 more variables”: This row tells us what other columns are in our “tidied” dataframe (with the number 4 here likely to be different in examples where the model has a different number of predictor variables than in this example). The names refer to the variables in our model and would be different in your own examples. The columns for the IVs (here: v2caviol, TYPEDEMO1984_factor, and cpi) will list the values of these variables used in producing the prediction.

In the example above, predictions() has automatically held our two control variables (v2caviol and TYPEDEMO1984_factor) “constant” at the same value when calculating each predicted value. predictions() will hold continuous controls constant at their mean value, and factor variables at their mode, when newdata = datagrid() is used in this way. We can see this by looking at the columns for our independent variables in the output created by predictions():

preds1 |> 
  select(estimate, cpi, v2caviol, TYPEDEMO1984_factor)

# A tibble: 7 × 4
  estimate   cpi v2caviol TYPEDEMO1984_factor
     <dbl> <dbl>    <dbl> <fct>              
1    0.318    20   -0.394 Autocracies        
2    0.382    30   -0.394 Autocracies        
3    0.445    40   -0.394 Autocracies        
4    0.509    50   -0.394 Autocracies        
5    0.573    60   -0.394 Autocracies        
6    0.636    70   -0.394 Autocracies        
7    0.700    80   -0.394 Autocracies

5.3.2 Predictions for a Factor Predictor Variable

We can use the same procedure as above to obtain predicted value for each category of a binary or categorical factor variable. Recall that we do this by using by= rather than newdata = datagrid().⁴ However, we do need to take one additional step here that we don’t have to perform in the previous example.

preds2 <- predictions(model_multiple, by= "TYPEDEMO1984_factor", 
                      newdata = "mean") |> 
  as_tibble()

by = "TYPEDEMO1984": This tells the command that we want predicted values for each category of our factor variable.
newdata = "mean"): This tells the command that we want to hold the other variables in the model constant at their mean (if a continuous variable) or mode (if a factor variable). This is done automatically in the earlier example, but must be specified here due to the use of the by = option.

Here are the predictions:

preds2

# A tibble: 2 × 9
  TYPEDEMO1984_factor estimate std.error statistic   p.value s.value conf.low
  <fct>                  <dbl>     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>
1 Autocracies            0.465    0.0204      22.8 5.85e-115    379.    0.425
2 Democracies            0.617    0.0261      23.6 3.19e-123    407.    0.566
# ℹ 2 more variables: conf.high <dbl>, df <dbl>

5.3.3 Predictions for specific combinations of the predictor variables

The first set of predictions we made above were predictions for each observation used in fitting the model based on the specific values of the independent variables associated with the observation. The second set of predictions we made focused on the average value of the DV our model tells us we should expect to see if one of the IVs equals a specific value and the other IVs are held constant at their mean or mode. We can also use predictions() to calculate predicted values for specific hypothetical cases. In this example, for instance, we use predictions() to estimate the democracy score in a country that was a democracy in 1984, has a corruption score equal to the maximum value that we observe in our data (88), and has the minimum level of political violence observed in the data (-3.429). We do this by including all predictors in the newdata = datagrid() portion of the syntax. If we left one of the variables out, then it would be held constant at its mean or mode depending on what type of variable it is.

predictions(model_multiple, 
            newdata = datagrid(cpi = c(88), 
                               v2caviol = c(-3.429), 
                               TYPEDEMO1984_factor = c("Democracies"))) |> 
  as_tibble()

# A tibble: 1 × 13
  rowid estimate std.error statistic   p.value s.value conf.low conf.high
  <dbl>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
1     1    0.930    0.0392      23.7 1.89e-124    411.    0.853      1.01
# ℹ 5 more variables: v2x_polyarchy <dbl>, cpi <dbl>, v2caviol <dbl>,
#   TYPEDEMO1984_factor <fct>, df <dbl>

We will introduce the augment() function from the broom package in a subsequent chapter as an alternative way of observing the residuals from a model. We use the predictions() command here because it is easier to create a dataframe that contains the model’s predictions, residuals, and all of the data from the original dataset with this command than it is with augment().↩︎
The s-value is an attempt to translate the p-value into a measure that some feel is easier to interpret. Specifically, it tells us “How many consecutive”heads” tosses would provide the same amount of evidence (or “surprise”) against the null hypothesis that the coin is fair?” As an example, a p-value of 0.05 would have a corresponding s-value of 4.3 or so. We might then say that a p-value of 0.05 is about as surprising as flipping a fair coin four times and seeing the coin land on heads all four times. Would you be comfortable making a statement that the coin is weighted rather than fair based on that string of coin flips? In the context of the output of produced by predictions() (and by the slopes() command that we will see in later chapters), higher S-values would indicate that we should be increasingly surprised to see our results if the value of the thing we’re estimating is actually 0. This statistic is not all that useful for our predicted values but could be more useful for understanding how surprising a coefficient or “marginal effect” estimate happens to be. If you like, you can read a deep dive on what p-values are, some of the complications researcher’s run into when interpreting them, and a discussion of what s-values are and how they may help matters in this blog post. The s-value is not examinable however.↩︎
Russia has an NA value for estimate and residual_value because it has a missing value on the TYPEDEMO1984 variable and hence was not included in the regression model.↩︎
We could technically use newdata = datagrid() but we’d then need to type out the factor levels (e.g., newdata = datagrid(TYPEDEMO1984 = c("Autocracies', "Democracies")) . The by = option is thus a bit easier to use.↩︎