#Packages
library(sjPlot) #checking variable names, values, and labels
library(rio) #loading data
library(tidyverse) #data manipulation and plotting
library(marginaleffects) #calculating marginal effects & predictions
#Data
ESS9NL <- import("ESS9e03, Netherlands.sav")11 Predicted Probabilities
Our last chapter focused on interpreting logistic regression models via marginal effects: how does the probability that Y = 1 change, on average, when an independent variable increases by one unit. In this chapter we’ll show you how to use the predictions() command from the marginaleffects package to obtain these underlying probabilities. We will discuss how to use this command to investigate three types of predicted probability:
- The predicted probability for each observation used in the model
- Average predicted probabilities for a specific predictor variable at different values of that variable
- Average predicted probabilities when holding the predictors at a specific combination of values
See Chapter 5 for the use of this command with linear regression models.
Here are the packages that we will use along with our data:
We will examine the same model that investigated in the last chapter - one where we predicted whether a survey respondent said that they had voted in the most recent election based on their gender, age, trust in politicians, and ideology. Here are the data management steps we took to prepare our data for analysis as well as our model; these steps are described in more detail in the preceding chapters:
#Data Preparation
ESS9NL <- ESS9NL |>
#Factorize our IVs
mutate(gndr_F = factorize(gndr),
vote_F = factorize(vote)) |>
#Remove Not Eligible to Vote Category from vote
mutate(vote_F = na_if(vote_F, "Not eligible to vote")) |>
#Relevel our variables like we did last time
mutate(vote_F = relevel(vote_F, "No"),
gndr_F = relevel(gndr_F, "Female")) |>
#Drop unused categories of the factor variables
mutate(vote_F = droplevels(vote_F),
gndr_F = droplevels(gndr_F))
#Our model
Vote_model_mp <- glm(vote_F ~ gndr_F + agea + trstplt + lrscale,
data = ESS9NL, family = "binomial")
#Check the output
summary(Vote_model_mp)
Call:
glm(formula = vote_F ~ gndr_F + agea + trstplt + lrscale, family = "binomial",
data = ESS9NL)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.284194 0.380455 -0.747 0.455
gndr_FMale 0.043281 0.154201 0.281 0.779
agea 0.018349 0.004503 4.075 4.61e-05 ***
trstplt 0.195020 0.038706 5.039 4.69e-07 ***
lrscale 0.029257 0.039306 0.744 0.457
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1173.9 on 1424 degrees of freedom
Residual deviance: 1135.3 on 1420 degrees of freedom
(248 observations deleted due to missingness)
AIC: 1145.3
Number of Fisher Scoring iterations: 4
11.1 Predictions for Individual Observations
Our logistic model makes a prediction for each observation in the model: what is the probability that the dependent variable = 1 (here: that people voted) based on the parameters of the model (the coefficients) and the observation’s combination of values on the independent variables. We can use the predictions() function to obtain these estimates.
#Store the results as a new object
Vote_pred <- predictions(Vote_model_mp,
conf_level = 0.95,
newdata = ESS9NL)
#We can use tibble() to get a nice tabular overview
tibble(Vote_pred)# A tibble: 1,673 × 581
rowid estimate p.value s.value conf.low conf.high name essround edition
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 1 0.835 6.13e-35 114. 0.796 0.867 ESS9e03 9 3
2 2 0.910 1.71e-56 185. 0.884 0.931 ESS9e03 9 3
3 3 0.904 4.26e-45 147. 0.874 0.928 ESS9e03 9 3
4 4 NA NA NA NA NA ESS9e03 9 3
5 5 0.864 9.71e-39 126. 0.828 0.894 ESS9e03 9 3
6 6 0.912 7.81e-53 173. 0.884 0.933 ESS9e03 9 3
7 7 0.800 1.32e-12 39.5 0.731 0.854 ESS9e03 9 3
8 8 0.914 9.43e-31 99.7 0.877 0.941 ESS9e03 9 3
9 9 0.877 1.88e-47 155. 0.845 0.903 ESS9e03 9 3
10 10 0.944 2.64e-39 128. 0.917 0.963 ESS9e03 9 3
# ℹ 1,663 more rows
# ℹ 572 more variables: proddate <chr>, idno <dbl>, cntry <chr>, nwspol <dbl>,
# netusoft <dbl>, netustm <dbl>, ppltrst <dbl>, pplfair <dbl>, pplhlp <dbl>,
# polintr <dbl>, psppsgva <dbl>, actrolga <dbl>, psppipla <dbl>,
# cptppola <dbl>, trstprl <dbl>, trstlgl <dbl>, trstplc <dbl>, trstplt <dbl>,
# trstprt <dbl>, trstep <dbl>, trstun <dbl>, vote <dbl>, prtvtcat <dbl>,
# prtvtdbe <dbl>, prtvtdbg <dbl>, prtvtgch <dbl>, prtvtbcy <dbl>, …
Here is how to read this syntax:
Vote_pred <--
Store results in a new data object. The results should always be saved to a new dataset, which we give a name of our choosing.
predictions(Vote_model_mp,-
Apply the function predictions on the model specified in brackets.
conf_level = 0.95,-
This is the default confidence interval. This can be safely omitted from the syntax if you want the 95% confidence interval. Alternatively, if you wanted to generate some other confidence interval (e.g., the 90% or 99%) then you would include this and change the numeric value (e.g.,
conf_level = 0.99). newdata = ESS9NL)-
The inclusion of this bit of syntax makes
predictions()copy over all other variables from the original dataset. This part can thus be omitted from the syntax if including the other variables is not needed.
11.2 Average Predicted Values for Specific IVs
We can also use the predictions() function to obtain the average probability that the dependent variable = 1 at specific values of an independent variable based on our model. For instance, we might want to see the expected probability of a person voting if they are 20 years old vs. if they are 60 years old to help communicate the meaning of our results. These types of predictions may then be presented in a figure of predicted probabilities as shown in Section 14.5 .
We discuss in Section 10.1 a problem that can arise when using commands from the marginaleffects package with a factor variable created via factorize(). The specific problem is created when factorize() is used to create a factor variable but there happens to be a category (or categories) of the resulting factor variable with zero observations. The same issue can arise when using predictions() (and especially in the most up to date version of the package). This problem is likely qutie rare in practice, but is nevertheless an annoying one. The solution is the same as discussed there: either use factor() to create the factor variable or droplevels() to remove the problem-creating categories (as we did when cleaning our data at the beginning of this chapter). See Section 10.1 and Section A.4 for further background on the problem and its solutions.
11.2.1 Continuous Predictor Variable
Here, we use predictions() to show us the predicted probability of voting based on the trust in politicians variable (trstplt). This variable is scaled from 0 to 10 in 1 point increments (missing value categories are shown here but coded as NA in our dataset):
ESS9NL |>
select(trstplt) |>
view_df()| ID | Name | Label | Values | Value Labels |
| 1 | trstplt | Trust in politicians | 0 1 2 3 4 5 6 7 8 9 10 77 88 99 |
No trust at all 1 2 3 4 5 6 7 8 9 Complete trust Refusal Don't know No answer |
table(ESS9NL$trstplt)
0 1 2 3 4 5 6 7 8 9 10
43 41 68 90 173 292 457 375 87 16 8
We will ask predictions() to calculate predicted values from 0 to 10 in 2pt increments in this example. An alternative would be to calculate probabilities for all increments of the variable, although this would produce an overload of output if the variable could take on many different values. An alternative in those scenarios would be use to the values at the minimum, 1st quartile, median, 3rd quartile, and maximum of the variable (with these values obtainable via the same process shown Section 5.3.1).
#Store the predictions as a new object
Pred_conts <- predictions(Vote_model_mp,
newdata = datagrid(trstplt = seq(from = 0, to = 10, by = 2))) newdata = datagrid(trstplt-
All predictors in the model will be held at the mean/mode, except for those specified between brackets.
= seq(from = 0, to = 10, by = 2)))-
We ask for predictions at several values for a specific sequence (
seq) of numbers:fromdefines the minimum,tothe maximum, andbythe increment. We could alternatively have written these numbers out (e.g.,trstplt = c(0,2,4,6,8,10)) - this may be more or less labor for us depending on the scale of the variable.
Let’s take a look at the predictions:
tibble(Pred_conts)- 1
-
tibble()is used only to give you a glimpse of the underlying data that the command creates.
# A tibble: 6 × 12
rowid estimate p.value s.value conf.low conf.high agea gndr_F lrscale vote_F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <fct> <int> <fct>
1 1 0.699 9.50e- 5 13.4 0.603 0.780 51 Male 5 Yes
2 2 0.774 1.27e-15 49.5 0.717 0.822 51 Male 5 Yes
3 3 0.835 4.20e-46 151. 0.802 0.863 51 Male 5 Yes
4 4 0.882 4.23e-64 211. 0.855 0.904 51 Male 5 Yes
5 5 0.917 9.54e-48 156. 0.889 0.939 51 Male 5 Yes
6 6 0.942 4.17e-34 111. 0.912 0.962 51 Male 5 Yes
# ℹ 2 more variables: trstplt <dbl>, df <dbl>
The layout of this output is the same as we saw when making predictions from a linear regression model (see Chapter 5):
- The
estimatecolumn provides the predicted probability. We can multiply this value by 100 to express the probability as a percent. For example, the predicted probability of voting is 69.89% for someone with a score of 0 on thetrstpltscale. - The
p.valuethroughconf.highcolumns provide uncertainty estimates. - We can then see columns for the other independent variables in the model (
gndr_F,agea,lrscale). These columns tell us the value that these variables have been held constant at when making predictions from the model. Thepredictions()command will automatically hold continuous variables constant at their mean, and factor variables at their mode, whennewdata = datagrid()is used in the manner shown above. - The final two columns are
trstplt, which shows the value of the trust in politicians measure used in making the prediction, and a column (not shown) for our DV (vote_F) that indicates which category is being predicted.
11.2.2 Factor Predictor Variable
The code for categorical variables is slightly different as we use the by = statement. Here, we calculate the average predicted probability to vote for men and women, with all other predictors held at representative values (the mean for continuous or the mode for factor variables).
#Obtain predictions and store as new object
Pred_cat <- predictions(Vote_model_mp,
by = "gndr_F",
newdata = "mean")
#Call the object in a nice tabular view
tibble(Pred_cat)# A tibble: 2 × 9
gndr_F estimate std.error statistic p.value s.value conf.low conf.high df
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female 0.855 0.0137 62.3 0 Inf 0.828 0.882 Inf
2 Male 0.860 0.0132 65.1 0 Inf 0.834 0.886 Inf
by = "gndr_F"-
This tells the command that we want predicted values for each category of our factor variable.
newdata = "mean"-
This tells the command that we want to hold the other variables in the model constant at their mean (if a continuous variable) or mode (if a factor variable). This must be specified here due to the use of the
by =option.
11.3 Predictions for specific combinations of the predictor variables
Finally, we can estimate what the predicted probability would be for an observation with specific values for each of the predictors in the model.
Here, we estimate the probability for a man (gndr_F), aged 33 (agea), with a score of 2 for trust in politicians (trstplt), and a score of 8 for left-right position (lrscale). To do so we need to specify the values for all variables between brackets after newdata = datagrid().
#Calculate and store
Pred_specific <- predictions(Vote_model_mp,
newdata = datagrid(gndr_F = c("Male"),
agea = c(33),
trstplt = c(2),
lrscale = c(8)))
Pred_specific- 1
- We need to use parentheses with this variable because it is a (labelled) factor variable.
gndr_F agea trstplt lrscale Estimate Pr(>|z|) S 2.5 % 97.5 %
Male 33 2 8 0.729 <0.001 20.2 0.645 0.799
Type: invlink(link)