2  Basic Statistics

2.1 Basic stats

The following are simple examples of how to compute basic statistics using R. We will start importing the data. Let’s use the free dataset iris, available in R.

iris <- iris 

First, explore the dataset

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

We have 5 variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width e Species. The first four are numeric while the last is a string with three groups: setosa, versicolor, and virginica. The dataset contains 150 observations

Let’s take a look at the first 10 observations in the dataset.

head(iris, 10) 
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

Now, let’s take a look at the first 5 observations of each species.

Species: setosa
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
Species: versicolor
   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
51          7.0         3.2          4.7         1.4 versicolor
52          6.4         3.2          4.5         1.5 versicolor
53          6.9         3.1          4.9         1.5 versicolor
54          5.5         2.3          4.0         1.3 versicolor
55          6.5         2.8          4.6         1.5 versicolor
Species: virginica
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
104          6.3         2.9          5.6         1.8 virginica
105          6.5         3.0          5.8         2.2 virginica

2.1.1 Mean and Median

Now, let’s find the means and the medians of the four numeric variables. The most intuitive way is as follows.

[1] 5.843333
[1] 3.057333
[1] 3.758
[1] 1.199333
[1] 5.8
[1] 3
[1] 4.35
[1] 1.3

But this is the easiest way.

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
 setosa    :50  
 virginica :50  

See that in the code above you also have the number of observations of each species. If you wanted to know how many observations you have in the dataset, you could use the following line. This might be important in the future.

[1] 150

If you want the same thing by group, do as follows:

by(iris, iris$Species, summary)
iris$Species: setosa
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
 setosa    :50  
 versicolor: 0  
 virginica : 0  
iris$Species: versicolor
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
iris$Species: virginica
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500  
 setosa    : 0  
 versicolor: 0  
 virginica :50  

2.1.2 Minimum and maximum

The function summaryalready gives you the mininum and maximum of all variables. But sometimes you need to find only these values. You could use the following lines.

[1] 4.3
[1] 7.9

You could also find the range of values to find the extreme values of a variable.

[1] 4.3 7.9

If you need the distance between the extreme values, you can use:

max(iris$Sepal.Length) - min(iris$Sepal.Length)
[1] 3.6

2.1.3 Standard-deviation and Variance

Finally, you can also compute the Standard-deviation and Variance of one variable as follows.

[1] 0.8280661
[1] 0.6856935

If you want the standard deviaation of all variables:

lapply(iris[, 1:4], sd)
[1] 0.8280661

[1] 0.4358663

[1] 1.765298

[1] 0.7622377

2.2 Correlation

The following lines will show you a correlation table. First, you need to create a new dataframe with only the numeric variables. This is an extremely important table to your academic paper.

iris_num <- select(iris,-Species)
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

2.3 Frequence table

Here is another important table you might use in your paper: the frequency of observations by group.


    setosa versicolor  virginica 
        50         50         50 

2.4 T-test

Let’s create now a t-test of the difference in means. For that, we will use another dataset: mtcars. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles. You can find the description of the variables here.

You will find that there is one variable that is binary: either the cars are automatic (1) or are manual (0).

When you have binary variables, it is always a good idea to test if the means of the variables are different between the two groups of the binary variable.

This is a big thing and you will use a lot in your academic research. In fact, in many articles, the authors explore and compare two groups. So, be ready to create such an analysis.

First, import the new dataset. Then, repeat the first steps and inspect this dataset (I will not inspect the dataset here, but you should inspect as a way to practice it).

mtcars <- mtcars

Then, use the binary variable to check if the other variables have similar means. In the following case, I am comparing the average of mpg (miles per gas) of an automatic with a manual car.

t.test(mpg ~ am, data=mtcars)

    Welch Two Sample t-test

data:  mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -11.280194  -3.209684
sample estimates:
mean in group 0 mean in group 1 
       17.14737        24.39231 

We see that the average in the automatic car is around 17.1 while the average of manual cars is 24.3.

These averages are statistically different, since the t-stat is high (-3.76) and p-value is low (0.001).

So, we can learn that automatic cars consume more gas than manual cars.

This type of test will be too important in your article.