11 Predicted Probabilities

Our last chapter focused on interpreting logistic regression models via marginal effects: how does the probability that Y = 1 change, on average, when an independent variable increases by one unit. In this chapter we’ll show you how to use the predictions() command from the marginaleffects package to obtain these underlying probabilities. We will discuss how to use this command to investigate three types of predicted probability:

The predicted probability for each observation used in the model
Average predicted probabilities for a specific predictor variable at different values of that variable
Average predicted probabilities when holding the predictors at a specific combination of values

See Chapter 5 for the use of this command with linear regression models.

Here are the packages that we will use along with our data:

#Packages
library(sjPlot)          #checking variable names, values, and labels
library(rio)             #loading data
library(tidyverse)       #data manipulation and plotting
library(marginaleffects) #calculating marginal effects & predictions

#Data
ESS9NL <- import("ESS9e03, Netherlands.sav")

We will examine the same model that investigated in the last chapter - one where we predicted whether a survey respondent said that they had voted in the most recent election based on their gender, age, trust in politicians, and ideology. Here are the data management steps we took to prepare our data for analysis as well as our model; these steps are described in more detail in the preceding chapters:

#Data Preparation
ESS9NL <- ESS9NL |>
  #Factorize our IVs
  mutate(gndr_F = factorize(gndr),  
         vote_F = factorize(vote))  |> 
  #Remove Not Eligible to Vote Category from vote
  mutate(vote_F = na_if(vote_F, "Not eligible to vote")) |>  
  #Relevel our variables like we did last time
  mutate(vote_F = relevel(vote_F, "No"), 
         gndr_F = relevel(gndr_F, "Female")) |> 
  #Drop unused categories of the factor variables
  mutate(vote_F = droplevels(vote_F), 
         gndr_F = droplevels(gndr_F))

#Our model
Vote_model_mp <- glm(vote_F ~ gndr_F + agea + trstplt + lrscale, 
                     data = ESS9NL, family = "binomial")

#Check the output
summary(Vote_model_mp)


Call:
glm(formula = vote_F ~ gndr_F + agea + trstplt + lrscale, family = "binomial", 
    data = ESS9NL)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.284194   0.380455  -0.747    0.455    
gndr_FMale   0.043281   0.154201   0.281    0.779    
agea         0.018349   0.004503   4.075 4.61e-05 ***
trstplt      0.195020   0.038706   5.039 4.69e-07 ***
lrscale      0.029257   0.039306   0.744    0.457    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1173.9  on 1424  degrees of freedom
Residual deviance: 1135.3  on 1420  degrees of freedom
  (248 observations deleted due to missingness)
AIC: 1145.3

Number of Fisher Scoring iterations: 4

11.1 Predictions for Individual Observations

Our logistic model makes a prediction for each observation in the model: what is the probability that the dependent variable = 1 (here: that people voted) based on the parameters of the model (the coefficients) and the observation’s combination of values on the independent variables. We can use the predictions() function to obtain these estimates.

#Store the results as a new object
Vote_pred <- predictions(Vote_model_mp,
                         conf_level = 0.95, 
                         newdata = ESS9NL)

#We can use tibble() to get a nice tabular overview
tibble(Vote_pred)

# A tibble: 1,673 × 581
   rowid estimate   p.value s.value conf.low conf.high name    essround edition
   <int>    <dbl>     <dbl>   <dbl>    <dbl>     <dbl> <chr>      <dbl> <chr>  
 1     1    0.835  6.13e-35   114.     0.796     0.867 ESS9e03        9 3      
 2     2    0.910  1.71e-56   185.     0.884     0.931 ESS9e03        9 3      
 3     3    0.904  4.26e-45   147.     0.874     0.928 ESS9e03        9 3      
 4     4   NA     NA           NA     NA        NA     ESS9e03        9 3      
 5     5    0.864  9.71e-39   126.     0.828     0.894 ESS9e03        9 3      
 6     6    0.912  7.81e-53   173.     0.884     0.933 ESS9e03        9 3      
 7     7    0.800  1.32e-12    39.5    0.731     0.854 ESS9e03        9 3      
 8     8    0.914  9.43e-31    99.7    0.877     0.941 ESS9e03        9 3      
 9     9    0.877  1.88e-47   155.     0.845     0.903 ESS9e03        9 3      
10    10    0.944  2.64e-39   128.     0.917     0.963 ESS9e03        9 3      
# ℹ 1,663 more rows
# ℹ 572 more variables: proddate <chr>, idno <dbl>, cntry <chr>, nwspol <dbl>,
#   netusoft <dbl>, netustm <dbl>, ppltrst <dbl>, pplfair <dbl>, pplhlp <dbl>,
#   polintr <dbl>, psppsgva <dbl>, actrolga <dbl>, psppipla <dbl>,
#   cptppola <dbl>, trstprl <dbl>, trstlgl <dbl>, trstplc <dbl>, trstplt <dbl>,
#   trstprt <dbl>, trstep <dbl>, trstun <dbl>, vote <dbl>, prtvtcat <dbl>,
#   prtvtdbe <dbl>, prtvtdbg <dbl>, prtvtgch <dbl>, prtvtbcy <dbl>, …

Here is how to read this syntax:

Vote_pred <-: Store results in a new data object. The results should always be saved to a new dataset, which we give a name of our choosing.
predictions(Vote_model_mp,: Apply the function predictions on the model specified in brackets.
conf_level = 0.95,: This is the default confidence interval. This can be safely omitted from the syntax if you want the 95% confidence interval. Alternatively, if you wanted to generate some other confidence interval (e.g., the 90% or 99%) then you would include this and change the numeric value (e.g., conf_level = 0.99).
newdata = ESS9NL): The inclusion of this bit of syntax makes predictions() copy over all other variables from the original dataset. This part can thus be omitted from the syntax if including the other variables is not needed.

11.2 Average Predicted Values for Specific IVs

We can also use the predictions() function to obtain the average probability that the dependent variable = 1 at specific values of an independent variable based on our model. For instance, we might want to see the expected probability of a person voting if they are 20 years old vs. if they are 60 years old to help communicate the meaning of our results. These types of predictions may then be presented in a figure of predicted probabilities as shown in Section 14.5 .

Warning!

We discuss in Section 10.1 a problem that can arise when using commands from the marginaleffects package with a factor variable created via factorize(). The specific problem is created when factorize() is used to create a factor variable but there happens to be a category (or categories) of the resulting factor variable with zero observations. The same issue can arise when using predictions() (and especially in the most up to date version of the package). This problem is likely qutie rare in practice, but is nevertheless an annoying one. The solution is the same as discussed there: either use factor() to create the factor variable or droplevels() to remove the problem-creating categories (as we did when cleaning our data at the beginning of this chapter). See Section 10.1 and Section A.4 for further background on the problem and its solutions.

11.2.1 Continuous Predictor Variable

Here, we use predictions() to show us the predicted probability of voting based on the trust in politicians variable (trstplt). This variable is scaled from 0 to 10 in 1 point increments (missing value categories are shown here but coded as NA in our dataset):

ESS9NL |> 
  select(trstplt) |> 
  view_df()

Data frame: select(ESS9NL, trstplt)
ID	Name	Label	Values	Value Labels
1	trstplt	Trust in politicians	0 1 2 3 4 5 6 7 8 9 10 77 88 99	No trust at all 1 2 3 4 5 6 7 8 9 Complete trust Refusal Don't know No answer

table(ESS9NL$trstplt)


  0   1   2   3   4   5   6   7   8   9  10 
 43  41  68  90 173 292 457 375  87  16   8

We will ask predictions() to calculate predicted values from 0 to 10 in 2pt increments in this example. An alternative would be to calculate probabilities for all increments of the variable, although this would produce an overload of output if the variable could take on many different values. An alternative in those scenarios would be use to the values at the minimum, 1st quartile, median, 3rd quartile, and maximum of the variable (with these values obtainable via the same process shown Section 5.3.1).

#Store the predictions as a new object
Pred_conts <- predictions(Vote_model_mp,
                          newdata = datagrid(trstplt = seq(from = 0, to = 10, by = 2)))

newdata = datagrid(trstplt: All predictors in the model will be held at the mean/mode, except for those specified between brackets.
= seq(from = 0, to = 10, by = 2))): We ask for predictions at several values for a specific sequence (seq) of numbers: from defines the minimum, to the maximum, and by the increment. We could alternatively have written these numbers out (e.g., trstplt = c(0,2,4,6,8,10)) - this may be more or less labor for us depending on the scale of the variable.

Let’s take a look at the predictions:

tibble(Pred_conts)

1: tibble() is used only to give you a glimpse of the underlying data that the command creates.

# A tibble: 6 × 12
  rowid estimate  p.value s.value conf.low conf.high  agea gndr_F lrscale vote_F
  <dbl>    <dbl>    <dbl>   <dbl>    <dbl>     <dbl> <int> <fct>    <int> <fct> 
1     1    0.699 9.50e- 5    13.4    0.603     0.780    51 Male         5 Yes   
2     2    0.774 1.27e-15    49.5    0.717     0.822    51 Male         5 Yes   
3     3    0.835 4.20e-46   151.     0.802     0.863    51 Male         5 Yes   
4     4    0.882 4.23e-64   211.     0.855     0.904    51 Male         5 Yes   
5     5    0.917 9.54e-48   156.     0.889     0.939    51 Male         5 Yes   
6     6    0.942 4.17e-34   111.     0.912     0.962    51 Male         5 Yes   
# ℹ 2 more variables: trstplt <dbl>, df <dbl>

Output Explanation

The layout of this output is the same as we saw when making predictions from a linear regression model (see Chapter 5):

The estimate column provides the predicted probability. We can multiply this value by 100 to express the probability as a percent. For example, the predicted probability of voting is 69.89% for someone with a score of 0 on the trstplt scale.
The p.value through conf.high columns provide uncertainty estimates.
We can then see columns for the other independent variables in the model (gndr_F, agea, lrscale) . These columns tell us the value that these variables have been held constant at when making predictions from the model. The predictions() command will automatically hold continuous variables constant at their mean, and factor variables at their mode, when newdata = datagrid() is used in the manner shown above.
The final two columns are trstplt, which shows the value of the trust in politicians measure used in making the prediction, and a column (not shown) for our DV (vote_F) that indicates which category is being predicted.

11.2.2 Factor Predictor Variable

The code for categorical variables is slightly different as we use the by = statement. Here, we calculate the average predicted probability to vote for men and women, with all other predictors held at representative values (the mean for continuous or the mode for factor variables).

#Obtain predictions and store as new object
Pred_cat <- predictions(Vote_model_mp,
                        by = "gndr_F", 
                        newdata = "mean") 

#Call the object in a nice tabular view
tibble(Pred_cat)

# A tibble: 2 × 9
  gndr_F estimate std.error statistic p.value s.value conf.low conf.high    df
  <fct>     <dbl>     <dbl>     <dbl>   <dbl>   <dbl>    <dbl>     <dbl> <dbl>
1 Female    0.855    0.0137      62.3       0     Inf    0.828     0.882   Inf
2 Male      0.860    0.0132      65.1       0     Inf    0.834     0.886   Inf

by = "gndr_F": This tells the command that we want predicted values for each category of our factor variable.
newdata = "mean": This tells the command that we want to hold the other variables in the model constant at their mean (if a continuous variable) or mode (if a factor variable). This must be specified here due to the use of the by = option.

11.3 Predictions for specific combinations of the predictor variables

Finally, we can estimate what the predicted probability would be for an observation with specific values for each of the predictors in the model.

Here, we estimate the probability for a man (gndr_F), aged 33 (agea), with a score of 2 for trust in politicians (trstplt), and a score of 8 for left-right position (lrscale). To do so we need to specify the values for all variables between brackets after newdata = datagrid().

#Calculate and store
Pred_specific <- predictions(Vote_model_mp,
                             newdata = datagrid(gndr_F = c("Male"),
                                                agea = c(33),   
                                                trstplt = c(2), 
                                                lrscale = c(8)))

Pred_specific

1: We need to use parentheses with this variable because it is a (labelled) factor variable.


 gndr_F agea trstplt lrscale Estimate Pr(>|z|)    S 2.5 % 97.5 %
   Male   33       2       8    0.729   <0.001 20.2 0.645  0.799

Type: invlink(link)