Descriptive statistics in R
In this article explains how to compute the main descriptive statistics in R and how to visualize them in graphically .Descriptive statistics are brief descriptive coefficients that summarize a given data set, which can be either a representation of the entire or a sample of a population. Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics you are simply describing what is or what the data shows. Typically, there are two general types of descriptive statistic that are used to describe data, Measures of central tendency and measures of variability (spread). In Measures of central tendency describe the central position using a number of statistics, including the mode, median, and mean and measures of variability(spread) include including the range, quartiles, absolute deviation, variance and standard deviation.
I use iris dataset for doing descriptive statistics throughout this article. This dataset is imported by default in R .Now lets load the dataset and print head of iris dataset.
data <- iris #load the iris dataset and renamed it data
Print the head of dataset and see how the structure of the dataset.
head(data) #head of dataset
str(data) #structure of the dataset
The dataset contains 150 observations and 5 variables, representing the length and width of the sepal and petal and the species of 150 flowers.
Minimum and Maximum
Minimum and maximum found using the min()
and max()
functions.
min(data$Sepal.Length) #minimum
output : 4.3
max(data$Sepal.Length) #maximum
output : 7.9
Range
The range can be computed subtracting the minimum from the maximum.
max(data$Sepal.Length) — min(data$Sepal.Length) #Range
If not you can find the range using create your own function in R.
range2 <- function(x) {
range <- max(x) — min(x)
return(range)
}
range2(dat$Sepal.Length)
Output : 3.6
Mean
The mean can be computed with the mean()
function.
mean(data$Sepal.Length)
Output : 5.84333333333333
Note : if there is at least one missing value in your dataset, use
mean(data$Sepal.Length, na.rm = TRUE)
to compute the mean with the NA excluded.
Median
The median can be computed thanks to the median()
function.
median(data$Sepal.Length)
Output : 5.8
Mode
There is no function to find the mode of a variable. However, we can easily find creating your own function.
getmode = function(data) {
unique_X = unique(data)
freq_X = tabulate(match(data, unique_X))
unique_X[which.max(freq_X)]
}getmode(data$Sepal.Length)
First and third quartile
The first and third quartiles can be computed using the quantile()
function and by setting the second argument to 0.25 or 0.75.
quantile(data$Sepal.Length, 0.25) #first Quartile
Output : 25%: 5.1
quantile(data$Sepal.Length, 0.75) #Tird Quartile
Output : 75%: 6.4
Interquartile Range
The interquartile range (the difference between the first and third quartile) can be computed with the IQR()
function.
IQR(data$Sepal.Length)
Output : 1.3
Standard deviation and variance
Standard deviation and the variance is computed with the sd()
and var()
functions.
sd(data$Sepal.Length) # standard deviation
Output : 0.828066127977863
var(data$Sepal.Length) # variance
Output : 0.685693512304251
The standard deviation and the variance are computed as if the data represent a sample.
Summary
You can compute the minimum, 1st1st quartile, median, mean, 3rd3rd quartile and the maximum for all numeric variables of a dataset at once using summary().
summary(data)
If you need more descriptive statistics, use stat.desc()
from the package {pastecs}
stat.desc(data)
Coefficient of variation
The coefficient of variation can be found standard deviation divided by the mean.
sd(data$Sepal.Length) / mean(data$Sepal.Length)
Output : 0.14171125977944
Correlation
A correlation measures the relationship between two variables, that is, how they are linked to each other.
cor(data$Sepal.Length, data$Petal.Lengt
Output : 0.871753775886583
Another simple correlation matrix
The correlation matrix presents the correlation coefficients in a slightly more readable way.
First we remove the Species Column from the table because can’t represent string values in correlation matrix.
library(dplyr)
df = select (data,-c(Species))
Now we visualize the correlation matrix using corrplot library.
library(corrplot)
corrplot(cor(df),
method = “number”,
type = “upper” # show only upper side
)
Histogram
A histogram gives an idea about the distribution of a quantitative variable.To draw a histogram in R, use hist()
hist(data$Sepal.Length)
Boxplot
A boxplot graphically represents the distribution of a quantitative variable by visually displaying five common location such as minimum, median, first and third quartiles and maximum and outlier using the interquartile range (IQR).
Below a graph explaining the information present on a boxplot.
boxplot(data$Sepal.Length)
Boxplot for compare the length of the sepal across the different species.
boxplot(data$Sepal.Length ~ data$Species)
Scatterplot
Scatterplots allow to check whether there is a relationship between two quantitative variables.
scatterplot for length of the sepal and the length of the petal.
plot(data$Sepal.Length, data$Petal.Length)
Using the ggplot library scatterplots are can represent more informative when differentiating the points according to a factor, in this case the species.
ggplot(data) + aes(x = Sepal.Length, y = Petal.Length, colour = Species) + geom_point() + scale_color_hue()
Density plot
Density plot are same as the histogram its represent the distribution of a numeric variable. The functions plot()
and density()
are used together to draw a density plot.
plot(density(data$Sepal.Length))
This how we can do descriptive statistics using R.I hope this article helped you to do descriptive statistics in R.
Source : https://www.statsandr.com/blog/descriptive-statistics-in-r/