<- iris iris
2 Basic Statistics
2.1 Basic stats
The following are simple examples of how to compute basic statistics using R. We will start importing the data. Let’s use the free dataset iris
, available in R.
First, explore the dataset
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We have 5 variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width e Species. The first four are numeric while the last is a string with three groups: setosa, versicolor, and virginica. The dataset contains 150 observations
Let’s take a look at the first 10 observations in the dataset.
head(iris, 10)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Now, let’s take a look at the first 5 observations of each species.
by(iris,iris["Species"],head,n=5)
Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
------------------------------------------------------------
Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
------------------------------------------------------------
Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
2.1.1 Mean and Median
Now, let’s find the means and the medians of the four numeric variables. The most intuitive way is as follows.
mean(iris$Sepal.Length)
[1] 5.843333
mean(iris$Sepal.Width)
[1] 3.057333
mean(iris$Petal.Length)
[1] 3.758
mean(iris$Petal.Width)
[1] 1.199333
median(iris$Sepal.Length)
[1] 5.8
median(iris$Sepal.Width)
[1] 3
median(iris$Petal.Length)
[1] 4.35
median(iris$Petal.Width)
[1] 1.3
But this is the easiest way.
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
See that in the code above you also have the number of observations of each species. If you wanted to know how many observations you have in the dataset, you could use the following line. This might be important in the future.
length(iris$Species)
[1] 150
If you want the same thing by group, do as follows:
by(iris, iris$Species, summary)
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
Median :5.000 Median :3.400 Median :1.500 Median :0.200
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
Species
setosa :50
versicolor: 0
virginica : 0
------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
Median :6.500 Median :3.000 Median :5.550 Median :2.000
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
Species
setosa : 0
versicolor: 0
virginica :50
2.1.2 Minimum and maximum
The function summary
already gives you the mininum and maximum of all variables. But sometimes you need to find only these values. You could use the following lines.
min(iris$Sepal.Length)
[1] 4.3
max(iris$Sepal.Length)
[1] 7.9
You could also find the range of values to find the extreme values of a variable.
range(iris$Sepal.Length)
[1] 4.3 7.9
If you need the distance between the extreme values, you can use:
max(iris$Sepal.Length) - min(iris$Sepal.Length)
[1] 3.6
2.1.3 Standard-deviation and Variance
Finally, you can also compute the Standard-deviation and Variance of one variable as follows.
sd(iris$Sepal.Length)
[1] 0.8280661
var(iris$Sepal.Length)
[1] 0.6856935
If you want the standard deviaation of all variables:
lapply(iris[, 1:4], sd)
$Sepal.Length
[1] 0.8280661
$Sepal.Width
[1] 0.4358663
$Petal.Length
[1] 1.765298
$Petal.Width
[1] 0.7622377
2.2 Correlation
The following lines will show you a correlation table. First, you need to create a new dataframe with only the numeric variables. This is an extremely important table to your academic paper.
library(dplyr)
<- select(iris,-Species)
iris_num cor(iris_num)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
2.3 Frequence table
Here is another important table you might use in your paper: the frequency of observations by group.
table(iris$Species)
setosa versicolor virginica
50 50 50
2.4 T-test
Let’s create now a t-test of the difference in means. For that, we will use another dataset: mtcars
. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. You can find the description of the variables here.
You will find that there is one variable that is binary: either the cars are automatic (1) or are manual (0).
When you have binary variables, it is always a good idea to test if the means of the variables are different between the two groups of the binary variable.
This is a big thing and you will use a lot in your academic research. In fact, in many articles, the authors explore and compare two groups. So, be ready to create such an analysis.
First, import the new dataset. Then, repeat the first steps and inspect this dataset (I will not inspect the dataset here, but you should inspect as a way to practice it).
<- mtcars mtcars
Then, use the binary variable to check if the other variables have similar means. In the following case, I am comparing the average of mpg
(miles per gas) of an automatic with a manual car.
t.test(mpg ~ am, data=mtcars)
Welch Two Sample t-test
data: mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
We see that the average in the automatic car is around 17.1 while the average of manual cars is 24.3.
These averages are statistically different, since the t-stat is high (-3.76) and p-value is low (0.001).
So, we can learn that automatic cars consume more gas than manual cars.
This type of test will be too important in your article.