Descriptive statistics in R

5 min readAug 19, 2020

In this article explains how to compute the main descriptive statistics in R and how to visualize them in graphically .Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a population. Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics you are simply describing what is or what the data shows. Typically, there are two general types of descriptive statistic that are used to describe data, Measures of central tendency and measures of variability (spread). In Measures of central tendency describe the central position using a number of statistics, including the mode, median, and mean and measures of variability(spread) include including the range, quartiles, absolute deviation, variance and standard deviation.

I use iris dataset for doing descriptive statistics throughout this article. This dataset is imported by default in R .Now lets load the dataset and print head of iris dataset.

data <- iris  #load the iris dataset and renamed it data

Print the head of dataset and see how the structure of the dataset.

head(data) #head of dataset

str(data) #structure of the dataset

The dataset contains 150 observations and 5 variables, representing the length and width of the sepal and petal and the species of 150 flowers.

Minimum and Maximum

Minimum and maximum found using the min() and max() functions.

min(data$Sepal.Length) #minimum

output : 4.3

max(data$Sepal.Length) #maximum

output : 7.9

Range

The range can be computed subtracting the minimum from the maximum.

max(data$Sepal.Length) — min(data$Sepal.Length) #Range

If not you can find the range using create your own function in R.

range2 <- function(x) {
 range <- max(x) — min(x)
 return(range)
}
range2(dat$Sepal.Length)

Output : 3.6

Mean

The mean can be computed with the mean() function.

mean(data$Sepal.Length)

Output : 5.84333333333333

Note : if there is at least one missing value in your dataset, use mean(data$Sepal.Length, na.rm = TRUE) to compute the mean with the NA excluded.

Median

The median can be computed thanks to the median() function.

median(data$Sepal.Length)

Output : 5.8

Mode

There is no function to find the mode of a variable. However, we can easily find creating your own function.

getmode = function(data) {
 unique_X = unique(data)
 freq_X = tabulate(match(data, unique_X))
 unique_X[which.max(freq_X)]
}getmode(data$Sepal.Length)

First and third quartile

The first and third quartiles can be computed using the quantile() function and by setting the second argument to 0.25 or 0.75.

quantile(data$Sepal.Length, 0.25) #first Quartile

Output : 25%: 5.1

quantile(data$Sepal.Length, 0.75) #Tird Quartile

Output : 75%: 6.4

Interquartile Range

The interquartile range (the difference between the first and third quartile) can be computed with the IQR() function.

IQR(data$Sepal.Length)

Output : 1.3

Standard deviation and variance

Standard deviation and the variance is computed with the sd() and var() functions.

sd(data$Sepal.Length) # standard deviation

Output : 0.828066127977863

var(data$Sepal.Length) # variance

Output : 0.685693512304251

The standard deviation and the variance are computed as if the data represent a sample.

Summary

You can compute the minimum, 1st1st quartile, median, mean, 3rd3rd quartile and the maximum for all numeric variables of a dataset at once using summary().

summary(data)

If you need more descriptive statistics, use stat.desc() from the package {pastecs}

stat.desc(data)

Coefficient of variation

The coefficient of variation can be found standard deviation divided by the mean.

sd(data$Sepal.Length) / mean(data$Sepal.Length)

Output : 0.14171125977944

Correlation

A correlation measures the relationship between two variables, that is, how they are linked to each other.

cor(data$Sepal.Length, data$Petal.Lengt

Output : 0.871753775886583

Another simple correlation matrix

The correlation matrix presents the correlation coefficients in a slightly more readable way.

First we remove the Species Column from the table because can’t represent string values in correlation matrix.

library(dplyr)
df = select (data,-c(Species))

Now we visualize the correlation matrix using corrplot library.

library(corrplot)
corrplot(cor(df),
 method = “number”,
 type = “upper” # show only upper side
)

Histogram

A histogram gives an idea about the distribution of a quantitative variable.To draw a histogram in R, use hist()

hist(data$Sepal.Length)

Boxplot

A boxplot graphically represents the distribution of a quantitative variable by visually displaying five common location such as minimum, median, first and third quartiles and maximum and outlier using the interquartile range (IQR).

Below a graph explaining the information present on a boxplot.

boxplot(data$Sepal.Length)

Boxplot for compare the length of the sepal across the different species.

boxplot(data$Sepal.Length ~ data$Species)

Scatterplot

Scatterplots allow to check whether there is a relationship between two quantitative variables.

scatterplot for length of the sepal and the length of the petal.

plot(data$Sepal.Length, data$Petal.Length)

Using the ggplot library scatterplots are can represent more informative when differentiating the points according to a factor, in this case the species.

ggplot(data) + aes(x = Sepal.Length, y = Petal.Length, colour = Species) + geom_point() + scale_color_hue()

Density plot

Density plot are same as the histogram its represent the distribution of a numeric variable. The functions plot() and density() are used together to draw a density plot.