The most commonly used measures of central tendency are the mean, median and the mode. To show how we calculate them using R, we will use the three variables:
[1] East West East North North East West West West East North
Levels: East North West
temperature #ordinal
[1] low high medium high low medium high
Levels: low < medium < high
exam_points #interval/ratio
[1] 2 7 3 4 2 0
7.1.1 Mode
The mode is the value that has highest number of occurrences in a set of data. We can first have a look at the distribution of the values with the table() function
table(directions)
Based on this, we can see that there are two modes (East and West).
directions East North West 434
If we want to calculate the mode, we can use a package called DescTools. This is a collection of miscellaneous basic statistic functions for efficiently describing data. The package contains the Mode() (notice the upper case M) function.
We load the necessary packages as follows:
library(DescTools) # This loads DescTools
To calculate the mode, we write:
Mode(directions)
[1] East West
attr(,"freq")
[1] 4
Levels: East North West
The results show that [1] East West are the two modes.
7.1.2 Median
The median is the value that, assuming the dataset is ordered from smallest to largest, falls in the middle. To calculate it, we use the Median() command from the DescTools package (again, notice the upper case M). This can deal with factors:
Median(temperature)
>Median(temperature)[1] mediumLevels: low < medium < high
7.1.3 Mean
The mean of a set of observations is calculated by adding up all the values and then divide by the total number of values. Let us first calculate the sum of all values via sum().
sum(exam_points)
[1] 18
Next, we calculate the number of observations via length().
length(exam_points)
[1] 6
To calculate the mean we can simply combine these two:
sum(exam_points)/length(exam_points)
[1] 3
or we can simply use the mean() function:
mean(exam_points, na.rm =TRUE)
[1] 3
, na.rm = TRUE
We add the last part , na.rm = TRUE to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.
7.2 Measures of dispersion
To find the minimum and maximum value of a vector or column we can use the max and min functions.
min()max()
Applied to the ordinal and interval/ratio variable that we have created:
min(temperature, na.rm =TRUE)
[1] low
Levels: low < medium < high
max(exam_points, na.rm =TRUE)
[1] 7
, na.rm = TRUE
We add the last part , na.rm = TRUE to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.
7.2.1 Range and interquartile range
range() returns a vector containing the minimum and maximum of all the given arguments.
range(x, na.rm =TRUE)
, na.rm = TRUE
We add the last part , na.rm = TRUE in the range() function to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.
We can calculate the range of the ordinal and interval/ratio variable that we have created:
range(temperature, na.rm =TRUE)
[1] low high
Levels: low < medium < high
range(exam_points, na.rm =TRUE)
[1] 0 7
7.2.1.1 Interquatile range
To get an overview of the interquartile range, we can use the summary() function:
summary(exam_points)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 2.00 2.50 3.00 3.75 7.00
7.2.2 Standard deviation and variance
You can calculate standard deviation in R using the sd() function. By default, this will generate the sample standard deviation. If you wish to use it to generate the population standard deviation, you can make the appropriate adjustment (multiply by sqrt((n-1)/n)).
sd(exam_points, na.rm =TRUE)
[1] 2.366432
, na.rm = TRUE
We add the last part , na.rm = TRUE to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.
To calculate the variance, we can simply use the var() function:
var(exam_points, na.rm =TRUE)
[1] 5.6
, na.rm = TRUE
We add the last part , na.rm = TRUE to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.
7.3 Doing operations in data frames
As explained in ?sec-data-week1, you use the “$” symbol to call variables in data frames. Therefore, if we want to use any of the functions that were covered so far (mean(), table(), sd(), var(), etc.) you simply use thus operator to access specific variables in data frames. If we want to calculate the mean and standard deviation of movie scores (‘Score’) in ‘deniro_data’ we would write:
We add the last part , na.rm = TRUE to tell R to omit any possible missing values from the calculation. Here, there were no missing values but in most existing data sets you will find them.
7.4 Descriptive statistics for an entire data frame
There are many summary statistics available in R. An easy way to get a basic overview of a data frame is the describe() function from the ‘psych’ package. If you need to install the pakcges, you can do so using install.packages("DescTools") or install.packages("psych") (not necessary on University PCs).
To use the describe() function from the ‘psych’ package, we write:
library(psych)
Warning: package 'psych' was built under R version 4.4.3
Attaching package: 'psych'
The following objects are masked from 'package:DescTools':
AUC, ICC, SD
vars n mean sd median min max range se
Year 1 87 1995.98 12.94 1997 1968 2016 48 1.39
Score 2 87 58.20 28.07 65 4 100 96 3.01
Title* 3 87 44.00 25.26 44 1 87 86 2.71
, na.rm = TRUE
We add , na.rm = TRUE to tell R to omit missing values.
, skew = FALSE
Via this part of the code, we tell R whether it should calculate the skewness of the variable. I would recommend to leave it out but if you wish to include it, change it to , skew = TRUE.
, ranges = TRUE
We add , ranges = TRUE to tell R to calculate the range. If you do not want this, you can set it to , ranges = FALSE
As you can see, this produces a relatively straightforward summary table.