2 Bivariate Regression with Binary & Categorical Predictors

Our last chapter concluded by showing how to perform a linear regression with a continuous (interval/ratio) independent variable. In this chapter we’ll see how to use a linear regression model with a binary or categorical independent variable.

As always, we begin our R script by loading relevant libraries and by loading our data. Note that these libraries are already installed on university computers but must be loaded prior to use.

#Packages
library(rio)          #loading data
library(tidyverse)    #data manipulation and plotting

#Import our data
demdata <- import("demdata.rds") |> 
  as_tibble()

1: You do not always need as_tibble() as shown here. We do this here because it is easier to show larger datasets with them. See Statistics I, 2.1.

2.1 Data Preparation: Converting to factor variable

One advantage of linear regression models is that we can predict a dependent variable with different types of predictor variable, including binary and categorical independent variables.

In order to use binary and categorical variables as predictors in a regression model we need to include them as a dichotomous or “dummy” variable. One dummy variable is used when the variable is binary while multiple dummies are included when the variable is categorical.¹ R automatically creates dummies for factor variables, so we will always convert our binary/categorical variables to factor variables before including them in a regression.²

We can see how to do this in the following example which focuses on the variable TYPEDEMO1984. This binary variable records whether a country was considered an autocracy or a democracy in the year 1984. Autocratic countries in 1984 have a score of 1 while democratic countries have a score of 2.

#Information about what type of variable this is: 
class(demdata$TYPEDEMO1984)

[1] "numeric"

#Simple tabulation
table(demdata$TYPEDEMO1984)


 1  2 
86 57

We will first convert this variable into a factor variable before including it in our regression model. We can do this either by using the built in factor() command (see Statistics I, 1.6.3) or by using the factorize() function that comes from the rio package. The factorize() function can be used as a quick way of creating a factor variable when the variable has value labels associated with it in the dataset. factor() needs to be used in situations where the variable does not have value labels (and when it is a character variable) because factorize() won’t produce the right type of outcome in that scenario; see Section A.2 for more on when to use which command and an example that uses factor() to create a factor variable.

You can investigate whether a variable has value labels associated with it in two ways. First, you can use the view_df() function from the sjPlot library to obtain an overall view of the variables in the dataset as shown in Section 1.1. Second, you can use the built in attributes() command to investigate a specific variable as below. We are looking for whether there is any information at all and, specifically, any information in the “$labels” area.

attributes(demdata$TYPEDEMO1984)

$label
[1] "Type of democracy, 1984"

$format.stata
[1] "%10.0g"

$labels
Autocracies Democracies 
          1           2

Our variable does indeed have value labels. Countries with a value of 1 were “Autocracies” in 1984, while countries with a value of 2 were “Democracies”. We can thus proceed to create our factor variable using factorize() and then check our work to make sure the results are what we expected.

#Step 1: Convert to factor variable
demdata <- demdata |> 
  mutate(TYPEDEMO1984_factor = factorize(TYPEDEMO1984)) 

#Double check your work! 
levels(demdata$TYPEDEMO1984_factor) # to check levels of factor variable

[1] "Autocracies" "Democracies"

table(demdata$TYPEDEMO1984_factor)  # simple tabulation


Autocracies Democracies 
         86          57

Here is how to read the factorize() syntax:

factorize(: This is the name of the function
TYPEDEMO1984: We then provide the name of the variable that we want to convert into a factor. The lowest numbered level on the variable will be used as the first level and, consequently, as the reference group when including this variable in our regression model. Here, autocracies will be treated as if they have a value of 0, and democracies as if it had a value of 1, in the regression model.

The same procedure is used for categorical variables with more than two categories. For instance, the variable Typeregime2006 provides information as to whether a country was considered a liberal democracy (=1), an electoral democracy (=2), or an autocracy (=3) in the year 2006. This variable also has value labels, so we can use factorize() to convert it into being a factor variable:

#convert variable to a factor variable
demdata <- demdata |> 
  mutate(Typeregime2006_factor = factorize(Typeregime2006))

#Double check your work!
levels(demdata$Typeregime2006_factor)

[1] "Liberal democracy"   "Electoral democracy" "Autocracy"

table(demdata$Typeregime2006_factor)


  Liberal democracy Electoral democracy           Autocracy 
                 71                  53                  41

Warning!

We strongly recommend creating new variables when recoding or factorizing an existing variable in a dataset (e.g., mutate(TYPEDEMO1984_factor = factorize(TYPEDEMO1984))). Creating a new variable when recoding/factorizing makes it easier to findM and correctM any mistakes we may inadvertently make when performing data management operations.

2.1.1 Relevelling

factorize() will use the first numeric value as the “reference” group when making a factor variable. factor() will do the same as well unless you tell it otherwise by explicitly specifying factor levels; see the example in Section A.2 . We can change what category of a factor variable is used as the reference group via the relevel() function. The example below does this for the Typeregime2006 categorical variable by changing the reference group from “Liberal Democracy” to “Electoral Democracy”.

demdata <- demdata |> 
  mutate(Typeregime2006_factor_relevel = relevel(Typeregime2006_factor, "Electoral democracy"))

relevel(: The name of the function
Typeregime2006_factor,: This tells R that we want to work with the variable named Typeregime2006.
"Electoral Democracy"): We then provide the name of the category we want as the reference group. We put “Electoral Democracy” in quotation marks because the variable we are relevelling here is already a factor variable so we must use the label for the category rather than its underlying numeric value.

It is always a good idea to check your efforts at data cleaning before performing analysis so that you can catch mistakes early on:

#Checking our work
levels(demdata$Typeregime2006_factor)

[1] "Liberal democracy"   "Electoral democracy" "Autocracy"

levels(demdata$Typeregime2006_factor_relevel)

[1] "Electoral democracy" "Liberal democracy"   "Autocracy"

2.2 Including in a Model and Interpreting Coefficients

We include binary/categorical variables in a regression model in the same way that we did for a continuous variable. For instance:

# Using the binary variable as a predictor: 
model_binary <- lm(v2x_polyarchy ~ TYPEDEMO1984_factor, data=demdata)
summary(model_binary)


Call:
lm(formula = v2x_polyarchy ~ TYPEDEMO1984_factor, data = demdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51025 -0.15007 -0.00857  0.17309  0.48543 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     0.41757    0.02333   17.90  < 2e-16 ***
TYPEDEMO1984_factorDemocracies  0.27268    0.03695    7.38 1.25e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2163 on 141 degrees of freedom
  (36 observations deleted due to missingness)
Multiple R-squared:  0.2786,    Adjusted R-squared:  0.2735 
F-statistic: 54.47 on 1 and 141 DF,  p-value: 1.247e-11

# Using the categorical variable as a predictor: 
model_categorical <- lm(v2x_polyarchy ~ Typeregime2006_factor, data=demdata)
summary(model_categorical)


Call:
lm(formula = v2x_polyarchy ~ Typeregime2006_factor, data = demdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.40104 -0.09898  0.00196  0.10773  0.47773 

Coefficients:
                                         Estimate Std. Error t value Pr(>|t|)
(Intercept)                               0.75404    0.01734   43.48   <2e-16
Typeregime2006_factorElectoral democracy -0.32106    0.02653  -12.10   <2e-16
Typeregime2006_factorAutocracy           -0.50577    0.02866  -17.64   <2e-16
                                            
(Intercept)                              ***
Typeregime2006_factorElectoral democracy ***
Typeregime2006_factorAutocracy           ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1461 on 162 degrees of freedom
  (14 observations deleted due to missingness)
Multiple R-squared:  0.6789,    Adjusted R-squared:  0.6749 
F-statistic: 171.2 on 2 and 162 DF,  p-value: < 2.2e-16

Output Explanation

The output from a model with a binary/categorical variable is the same as when the predictor is continuous with one exception: R will format the variable names differently in the Coefficients area depending on whether the variable is a factor variable or not. If the variable is not a factor variable, then R will simply show the variable name (e.g., “gini_2019”). If the variable is a factor, then the display is formatted as so: “variableCategory”. For instance: “TypeDemo1984_factorDemocracies” or “Typeregime2006_factorAutocracy.”

There are some subtle differences in how we interpret the coefficients of a model that (only) includes a factor variable as a predictor variable:

Interpretation

The Estimate column provides the coefficient values from our regression model.

The “(Intercept)” row gives you the coefficient for the Intercept: What is the average value of the DV we expect to observe based on this model if all of the included IVs = 0. If the only predictor variable in the model is a factor variable, then the (Intercept) value = the mean of the dependent variable among observations in the initial level of the factor variable, what we often call the “reference group”.

Here, for instance, is the mean value of v2x_polyarchy among countries with a value of “Autocracies” on the TYPEDEMO1984_factor variable; it is identical to the (Intercept) value above.

demdata |> 
  filter(TYPEDEMO1984_factor == "Autocracies") |>
  summarize(mean_democracy = mean(v2x_polyarchy, na.rm=T)) |> 
  as.data.frame()

1: This removes all observations that do not have the value of “Autocracies” on our factor variable
2: This option is used to force R to show you all the digits of the mean value so you can compare it to the Intercept value reported in the summary output

  mean_democracy
1      0.4175698

The coefficient for a continuous predictor variable is interpreted as the slope of a line (e.g., how much does Y change, on average, with each one unit change in X?). The coefficient(s) for a factor variable, on the other hand, are best discussed as telling us the difference in the mean value of the DV between the category named in the output and the reference group. The coefficient for “TYPEDEMO1984_factorDemocracies”, for instance, is 0.2726758.³ We would interpret this as telling us that the average value of the dependent variable among countries with a “Democracies” values on TYPEDEMO1984_factor is 0.2726758 scale points greater than the average value among countries with a value of “Autocracies’ level (our reference group) on that variable.

We can again see this by looking at the underlying average values:

# The averages
demdata |> 
  group_by(TYPEDEMO1984_factor) |> 
  summarize(mean_democracy = mean(v2x_polyarchy, na.rm = T)) |>
  as.data.frame()

  TYPEDEMO1984_factor mean_democracy
1         Autocracies      0.4175698
2         Democracies      0.6902456
3                <NA>      0.4638056

# Avg. in Democracies - Avg. in Autocracies
 0.6902456 - 0.4175698

[1] 0.2726758

The same goes for models with categorical factor variables. The “(Intercept)” value in model_categorical above tells us the average value of the DV among observations in the reference group of the factor variable (there: “Liberal Democracy”) while the coefficients for the factor variable tell us the difference from this average score. The average 2020 democracy score among countries that are coded as an “Electoral Democracy” in 2006 is -0.32 scale points lower than the average 2020 democracy score among the “Liberal Democracy” reference group countries, for instance.

demdata |> 
  group_by(Typeregime2006_factor) |> 
  summarize(mean_democracy = mean(v2x_polyarchy, na.rm=T)) |> 
  as.data.frame()

  Typeregime2006_factor mean_democracy
1     Liberal democracy      0.7540423
2   Electoral democracy      0.4329811
3             Autocracy      0.2482683
4                  <NA>      0.3777143

# Avg. in Elec Democracy - Avg. in Lib Democracy
0.4329811 - 0.7540423

[1] -0.3210612

# Avg. in Autocracy - Avg. in Lib Democracy
0.2482683 - 0.7540423

[1] -0.505774

See Section 8.4 for how to report these values in our assignments and in formal papers.

Specifically, we include k-1 dummies, where k = the number of categories. If the categorical variable has four categories (for instance: North, West, South, and East), then we include three dummy variables in the model. If it has two categories (i.e., a binary variable), then we include one dummy variable in the model.↩︎
In some circumstances the variable may already be stored as factor variable in our dataset enabling us to skip this first step. However, we may need to change what category is used as the reference category, which we can do as shown in a subsequent sub-section of this document.↩︎
We would normally round this to 2 or 3 digits, but we show you the whole coefficient here so you can see how it compares to the difference in means.↩︎