7 Measures of central tendency and dispersion

7.1 Measures of central tendency

The most commonly used measures of central tendency are the mean, median and the mode. To show how we calculate them using R, we will use the three variables:

directions <- factor(c("East", "West", "East", "North", "North", "East", "West", "West", "West", "East", "North"))
temperature <- factor(c("low", "high", "medium", "high", "low", "medium", "high"), levels = c("low", "medium", "high"), ordered = TRUE)
exam_points <- c(2, 7, 3, 4, 2, 0)

directions #nominal

 [1] East  West  East  North North East  West  West  West  East  North
Levels: East North West

temperature #ordinal

[1] low    high   medium high   low    medium high  
Levels: low < medium < high

exam_points #interval/ratio

[1] 2 7 3 4 2 0

7.1.1 Mode

The mode is the value that has highest number of occurrences in a set of data. We can first have a look at the distribution of the values with the table() function

table(directions)

Based on this, we can see that there are two modes (East and West).

directions
 East North  West 
    4     3     4

If we want to calculate the mode, we can use a package called DescTools. This is a collection of miscellaneous basic statistic functions for efficiently describing data. The package contains the Mode() (notice the upper case M) function.

We load the necessary packages as follows:

library(DescTools) # This loads DescTools

To calculate the mode, we write:

Mode(directions)

[1] East West
attr(,"freq")
[1] 4
Levels: East North West

The results show that [1] East West are the two modes.

7.1.2 Median

The median is the value that, assuming the dataset is ordered from smallest to largest, falls in the middle. To calculate it, we use the Median() command from the DescTools package (again, notice the upper case M). This can deal with factors:

Median(temperature)

> Median(temperature)
[1] medium
Levels: low < medium < high

7.1.3 Mean

The mean of a set of observations is calculated by adding up all the values and then divide by the total number of values. Let us first calculate the sum of all values via sum().

sum(exam_points)

[1] 18

Next, we calculate the number of observations via length().

length(exam_points)

[1] 6

To calculate the mean we can simply combine these two:

sum(exam_points)/length(exam_points)

[1] 3

or we can simply use the mean() function:

mean(exam_points, na.rm = TRUE)

[1] 3

, na.rm = TRUE: We add the last part , na.rm = TRUE to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.

7.2 Measures of dispersion

To find the minimum and maximum value of a vector or column we can use the max and min functions.

min()
max()

Applied to the ordinal and interval/ratio variable that we have created:

min(temperature, na.rm = TRUE)

[1] low
Levels: low < medium < high

max(exam_points, na.rm = TRUE)

[1] 7

, na.rm = TRUE: We add the last part , na.rm = TRUE to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.

7.2.1 Range and interquartile range

range() returns a vector containing the minimum and maximum of all the given arguments.

range(x, na.rm = TRUE)

, na.rm = TRUE: We add the last part , na.rm = TRUE in the range() function to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.

We can calculate the range of the ordinal and interval/ratio variable that we have created:

range(temperature, na.rm = TRUE)

[1] low  high
Levels: low < medium < high

range(exam_points, na.rm = TRUE)

[1] 0 7

7.2.1.1 Interquatile range

To get an overview of the interquartile range, we can use the summary() function:

summary(exam_points)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00    2.50    3.00    3.75    7.00

7.2.2 Standard deviation and variance

You can calculate standard deviation in R using the sd() function. By default, this will generate the sample standard deviation. If you wish to use it to generate the population standard deviation, you can make the appropriate adjustment (multiply by sqrt((n-1)/n)).

sd(exam_points, na.rm = TRUE)

[1] 2.366432

, na.rm = TRUE: We add the last part , na.rm = TRUE to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.

To calculate the variance, we can simply use the var() function:

var(exam_points, na.rm = TRUE)

[1] 5.6

, na.rm = TRUE: We add the last part , na.rm = TRUE to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.

7.3 Doing operations in data frames

As explained in ?sec-data-week1, you use the “$” symbol to call variables in data frames. Therefore, if we want to use any of the functions that were covered so far (mean(), table(), sd(), var(), etc.) you simply use thus operator to access specific variables in data frames. If we want to calculate the mean and standard deviation of movie scores (‘Score’) in ‘deniro_data’ we would write:

library(rio)
deniro_data <- import(file = "deniro.csv")

mean(deniro_data$Score, na.rm = TRUE)

[1] 58.1954

sd(deniro_data$Score, na.rm = TRUE)

[1] 28.06754

, na.rm = TRUE: We add the last part , na.rm = TRUE to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.

7.4 Descriptive statistics for an entire data frame

There are many summary statistics available in R. An easy way to get a basic overview of a data frame is the describe() function from the ‘psych’ package. If you need to install the pakcges, you can do so using install.packages("DescTools") or install.packages("psych") (not necessary on University PCs).

To use the describe() function from the ‘psych’ package, we write:

library(psych)

Warning: package 'psych' was built under R version 4.4.3


Attaching package: 'psych'

The following objects are masked from 'package:DescTools':

    AUC, ICC, SD

describe(deniro_data, na.rm = TRUE, skew = FALSE, ranges = TRUE)

       vars  n    mean    sd median  min  max range   se
Year      1 87 1995.98 12.94   1997 1968 2016    48 1.39
Score     2 87   58.20 28.07     65    4  100    96 3.01
Title*    3 87   44.00 25.26     44    1   87    86 2.71

, na.rm = TRUE: We add , na.rm = TRUE to tell R to omit missing values.
, skew = FALSE: Via this part of the code, we tell R whether it should calculate the skewness of the variable. I would recommend to leave it out but if you wish to include it, change it to , skew = TRUE.
, ranges = TRUE: We add , ranges = TRUE to tell R to calculate the range. If you do not want this, you can set it to , ranges = FALSE

As you can see, this produces a relatively straightforward summary table.