9 Logistic Regression & Odds Ratios

Logistic regression models are used to model binary outcome variables. We will use survey data from the Netherlands which was collected as part of Round 9 of the European Social Survey for our examples. The dataset is available in SPSS format (.sav) from the ESS website.

We begin, as always, by loading the libraries that we will use in our analyses as well as our data. Missing values are already indicated as ‘NA’ in this dataset. Per Section 1.1, we can use the view_df() from from the sjPlot library to inspect the variables included in the dataset.

#Packages
library(sjPlot)          #checking variable names, values, and labels
library(broom)           #for obtaining summaries of regression models
library(rio)             #loading data
library(tidyverse)       #data manipulation and plotting

#Data
ESS9NL <- import("ESS9e03, Netherlands.sav")

#view_df example on subset of dataset
ESS9NL |> 
  select(polintr, ppltrst) |> 
  sjPlot::view_df()

Data frame: select(ESS9NL, polintr, ppltrst)
ID	Name	Label	Values	Value Labels
1	polintr	How interested in politics	1 2 3 4 7 8 9	Very interested Quite interested Hardly interested Not at all interested Refusal Don't know No answer
2	ppltrst	Most people can be trusted or you can't be too careful	0 1 2 3 4 5 6 7 8 9 10 77 88 99	You can't be too careful 1 2 3 4 5 6 7 8 9 Most people can be trusted Refusal Don't know No answer

The view_df() output indicates that there are numeric codes associated with missing value categories (e.g., respondents who said “Don’t Know” would be given a score of 8 on polintr). However, those values are already converted to missing (NA) in our dataset as the tabulation below shows. If they were not, then we would need to take some additional data management steps and convert those values to missing values; see Section 4.2 of the Statistics I manual for a refresher on how to do this.

table(ESS9NL$polintr)


  1   2   3   4 
213 832 478 149

Warning!

Always double check assignment instructions and other sources of information about a dataset (e.g., its codebook or the dataset itself via something like view_df() or attributes()) to make sure you understand the data management steps you need to accomplish before an analysis.

9.1 Performing a Logistic Regression

9.1.1 Data Preparation

Our example will investigate the relationship between self-reported gender (gndr) and voter turnout (i.e., did the person report voting in an election or not; vote). Our goal here is to predict whether a person reported voting in the most recent election.

First, let us take a look at the variables so we can figure out if we need to take any preliminary data management steps:

#Variable attributes
ESS9NL |> 
  select(gndr, vote) |> 
  view_df()

Data frame: select(ESS9NL, gndr, vote)
ID	Name	Label	Values	Value Labels
1	gndr	Gender	1 2 9	Male Female No answer
2	vote	Voted last national election	1 2 3 7 8 9	Yes No Not eligible to vote Refusal Don't know No answer

#Tabulation
table(ESS9NL$gndr)


  1   2 
833 840

table(ESS9NL$vote)


   1    2    3 
1291  247  130

Our predictor variable only has two categories, so we will need to factorize it before using it in the model. Our DV has three categories with observations in them (Yes, No, Not Eligible). We need to make this into a binary factor variable prior to analysis. We can do this by converting the “Not Eligible” category to NA. Here is one way that we can accomplish these ends:

#Factorize variables
ESS9NL <- ESS9NL |>
  mutate(gndr_F = factorize(gndr),
         vote_F = factorize(vote))

#Drop the not eligible category
ESS9NL <- ESS9NL |>
  mutate(vote_F = na_if(vote_F,"Not eligible to vote"))

1: This will use the lowest numbered category as the “reference” or “baseline” category. In these examples, male respondents (gndr = 1) and those that say they voted in the last election (vote = 1).
2: We are not creating a new variable when recoding things here (e.g., we overwrite the original gndr and vote variables). This is not usually a good idea - a mistake here would mean that we need to reload our data and walk our data cleaning steps in order to fix our mistake. It is generally a much better idea to create new variables when recoding/factorizing. We are not doing so out of pure hubris and tempting fate that we are not making a mistake here due to our surely flawless understanding of the data.

Let’s check our work in regards to the vote variable:

levels(ESS9NL$vote_F)

[1] "Yes"                  "No"                   "Not eligible to vote"
[4] "Refusal"              "Don't know"           "No answer"

table(ESS9NL$vote_F)


                 Yes                   No Not eligible to vote 
                1291                  247                    0 
             Refusal           Don't know            No answer 
                   0                    0                    0

The vote variable is now a factor variable where the the first (or base) level of the factor is “Yes” because factorize() will use the first numerical category as the reference group. This is a problem for us because we want to predict whether a person is in the “Yes, voted” category and the regression command we will use below predicts whether a person is in the higher level of a factor. In other words, if we leave this variable alone our model would predict whether a person did not vote. We thus need to relevel the variable to flip the order of the categories (see Section 2.1.1).¹

#Relevel the variable
ESS9NL <- ESS9NL |> 
  mutate(vote_F = relevel(vote_F, "No"))

#Let's check our work
levels(ESS9NL$vote_F)

[1] "No"                   "Yes"                  "Not eligible to vote"
[4] "Refusal"              "Don't know"           "No answer"

mutate(vote_F = relevel(vote_F, "No")): We use the relevel() command on the vote variable. We do not create a new variable here but overwrite the original one. You could also chose to make a new variable, which is usually a better idea when you are recoding/factorizing a variable. The category provided in quotation marks will become the reference category. It is important to note this variable was factorized first so we use the label “No” and not the numeric value for “No” originally stored in the dataset, which was 2.

Let’s check our work for the gndr variable:

table(ESS9NL$gndr_F)


     Male    Female No answer 
      833       840         0

levels(ESS9NL$gndr_F)

[1] "Male"      "Female"    "No answer"

“Male” has been used as the reference category. The two categories have a roughly equal number of observations, so the automatic behavior of factorize() is not an issue here. There is a third label here “No Answer” with 0 observations on it. This is fine for now - R will exclude this category from the analysis below.

We will change the reference group to “Female” as a further example of the relevel() syntax:

ESS9NL <- ESS9NL |> 
  mutate(gndr_F = relevel(gndr_F, "Female"))

#check your work!
levels(ESS9NL$gndr_F)

[1] "Female"    "Male"      "No answer"

Warning!

The DV in a logistic regression should be a (factorized) binary variable. Make sure you are creating the factor variable such that the higher level of the variable is the category you are trying to predict. Otherwise, your interpretations might end up being wrong by accident.

9.1.2 Performing a Logistic Regression

Performing logistic regression in R is very similar to linear regression. However, instead of the lm() function, we rely on the glm() function, which stands for generalized linear model.

#Run the model
Vote_model <- glm(vote_F ~ gndr_F, 
                  data = ESS9NL, family = "binomial")

Vote_model <-: We assign the results of our estimation to a new object.
glm(vote_F ~ gndr_F,: We perform glm with vote as our dependent variable, predicted (~) by our only independent variable here: gndr. If we want to add more variables, we connect them with a ‘+’ sign.
data = ESS9NL,: We specify the dataset to be used.
family = "binomial"): We specify the family for the generalized linear model. For logistic regression this is “binomial”. This part of the code remains unchanged. See the Common Errors appendix ( Section A.3) for an error that could arise if you did not specify the model’s family.

Let’s take a look at the output using the built in summary() command:

summary(Vote_model)


Call:
glm(formula = vote_F ~ gndr_F, family = "binomial", data = ESS9NL)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.56485    0.09535  16.412   <2e-16 ***
gndr_FMale   0.18359    0.13925   1.318    0.187    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1355.5  on 1537  degrees of freedom
Residual deviance: 1353.7  on 1536  degrees of freedom
  (135 observations deleted due to missingness)
AIC: 1357.7

Number of Fisher Scoring iterations: 4

Output Explanation

The structure of this output is highly similar to what we saw with the output of an lm() model.

Call: The model being fit
Deviance Residuals: This provides some summary data about the model’s residuals.
Coefficients: This provides the coefficients from the model (Estimate) as well as the coefficient’s standard error (Std. Error), a test-statistic (z-value; the Z-Statistic as given by \(\frac{\textrm{Coefficient}}{\textrm{Std. Error}}\)), and the p-value for the test statistic (Pr(>|z|)). Symbols pertaining to statistical significance may be provided to the right of the p-value with a line indicating how to interpret these symbols provided at the bottom of the Coefficients output (“Signif. Codes:”).
(Dispersion parameter…): This can be ignored.
Area that begins with Null deviance: This area relates to the fit of the model, which will be discussed in a subsequent chapter.
Number of Fisher Scoring Iterations: This can be ignored.

We can add multiple predictors to the model in a way similar to the linear regression syntax, by adding them with a + symbol. Here add age (agea), trust in politicians (trstplt), and left-right ideology (lrscale). We did not need to take any data management steps with these variables because they are continuous variables and missing data is already coded as NA.

#Run the model
Vote_model_mp <- glm(vote_F ~ gndr_F + agea + trstplt + lrscale, 
                     data = ESS9NL, family = "binomial")

#Check the output
summary(Vote_model_mp)


Call:
glm(formula = vote_F ~ gndr_F + agea + trstplt + lrscale, family = "binomial", 
    data = ESS9NL)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.284194   0.380455  -0.747    0.455    
gndr_FMale   0.043281   0.154201   0.281    0.779    
agea         0.018349   0.004503   4.075 4.61e-05 ***
trstplt      0.195020   0.038706   5.039 4.69e-07 ***
lrscale      0.029257   0.039306   0.744    0.457    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1173.9  on 1424  degrees of freedom
Residual deviance: 1135.3  on 1420  degrees of freedom
  (248 observations deleted due to missingness)
AIC: 1145.3

Number of Fisher Scoring iterations: 4

Interpretation

Logistic regression coefficients are on the log of the odds scale. The coefficient for gndr_FMale in the vote_model_mp model tells us the difference in the log of the odds of voting between male and female respondents, while the agea coefficient tell us how the log of the odds of voting change with each one unit increase in respondent age.

You can use the direction (positive, negative) and statistical significance of a logistic regression coefficient to talk about the general relationship between the predictor variable and the DV. However, it is not really possible to directly and clearly communicate what a logistic coefficient tells us about the magnitude of the relationship between an IV and the DV because of this log of the odds scaling. You should instead focus on odds ratios (see below), average marginal effects (Chapter 10), or predicted probabilities (Chapter 11) to give more specific meaning to your discussion.

In this example:

Voter turnout is more likely among men than women, but the difference is not statistically significant (p = 0.78) so we cannot rule out the possibility of no difference in voter turnout between the two groups.
Voter turnout becomes more likely with age (i.e., older respondents are more likely to vote than younger ones) and this association is statistically significant (p < 0.001).
Voter turnout becomes more likely as trust in politicians increases and this relationship is statistically significant (p < 0.001).
Voter turnout grows more likely as we move from left to right on the ideology scale, but the effect is not statistically significant (p = 0.46) so we cannot rule out the possibility that a one unit change in ideology actually does not lead to any change in the chances of turning out to vote.

9.2 Odds Ratios

Odds ratios are one way that we can translate logistic regression coefficients into something easier to interpret and communicate.

We can use the tidy() command from the broom package to toggle between output expressed in the log of the odds scale (our logistic coefficients from above) and odds ratios.

#tidy with logistic regression and confidence intervals
tidy(Vote_model_mp, conf.int = TRUE)

# A tibble: 5 × 7
  term        estimate std.error statistic     p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>    <dbl>     <dbl>
1 (Intercept)  -0.284    0.380      -0.747 0.455       -1.03       0.463 
2 gndr_FMale    0.0433   0.154       0.281 0.779       -0.259      0.346 
3 agea          0.0183   0.00450     4.07  0.0000461    0.00958    0.0272
4 trstplt       0.195    0.0387      5.04  0.000000469  0.119      0.271 
5 lrscale       0.0293   0.0393      0.744 0.457       -0.0480     0.106

#tidy with odds ratios and confidence intervals
tidy(Vote_model_mp, conf.int = TRUE, exponentiate = TRUE)

# A tibble: 5 × 7
  term        estimate std.error statistic     p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>       <dbl>    <dbl>     <dbl>
1 (Intercept)    0.753   0.380      -0.747 0.455          0.357      1.59
2 gndr_FMale     1.04    0.154       0.281 0.779          0.772      1.41
3 agea           1.02    0.00450     4.07  0.0000461      1.01       1.03
4 trstplt        1.22    0.0387      5.04  0.000000469    1.13       1.31
5 lrscale        1.03    0.0393      0.744 0.457          0.953      1.11

Here is how to read the syntax for the latter command:

tidy(Vote_model_mp: We apply the tidy function on the model specified in brackets.
conf.int = TRUE: We ask for the confidence intervals for the logistic regression coefficients or odds ratios. We can write ‘FALSE’ or leave out this statement if confidence intervals are not needed.
exponentiate = TRUE): We ask for the exponentiated logistic regression coefficients, which are the odds ratios. We can shorten this to exp = TRUE and obtain the same results. We can write ‘FALSE’ or leave out this statement if we want the logistic regression coefficients.

Interpretation

There are three important things to keep in mind when interpreting odds ratios.

First, odds ratios tell us about the relative odds of seeing Y = 1 (e.g., seeing a person report turning out to vote). This is different than the coefficients from a logistic model which tell us about the log of those odds.

Second, we interpret odds ratios in relation to the number 1 rather than 0. Positive effects for X are seen when the odds ratio is greater than 1 (that is: an odds ratio > 1 tells you that it becomes more likely to see Y = 1 when the independent variable increases by one unit). Negative effects for X are seen when the odds ratio is smaller than 1 (i.e., Y = 1 becomes less likely to be observed when X increases by one unit). A confidence interval for an odds ratio that includes 1 in its range (as occurs for gndrMale and lrscale above) indicates a statistically insignificant relationship, while an odds ratio that does not include 1 in its range (as occurs for agea and trstplt) indicates a statistically significant relationship.

Third, we interpret odds ratios using multiplication language. For instance, our model indicates that the odds of voting are 1.04 times greater among male respondents than female respondents when holding the effects of age, ideology, and trust constant (although this difference is not statistically significant). Or: the odds of turning out to vote increase by 1.02 times for each one year increase in age (holding constant the other predictor variables).

We could alternatively use factor() and specify the order of the levels from the start. For instance, we could have done this: mutate(vote_binary = factor(vote, levels = c(2, 1), labels = c("Did not vote", "Voted")) . This would also avoid an issue we discuss in the next chapter and in the Common Errors appendix.↩︎