Dr. Ali Sajid Imami
March 20, 2018
Sumary measures are very important in basic statistical work. Similarly, graphical representation of data is also useful for exploring the data. A common mistake that we usually make is relying on just one and ignoring the other. This can be inconvenient at best and downright misleading at worst.
We will be exploring the interplay of graphics and data using Anscombe’s Quartet. This data has the same, or close enough to the same summary measures including:
However, all datasets are indeed very different.
set | Mean x | SD x | Var x | Mean y | SD y | Var y | Correlation |
---|---|---|---|---|---|---|---|
A | 9 | 3.32 | 11 | 7.5 | 2.03 | 11 | 0.816 |
B | 9 | 3.32 | 11 | 7.5 | 2.03 | 11 | 0.816 |
C | 9 | 3.32 | 11 | 7.5 | 2.03 | 11 | 0.816 |
D | 9 | 3.32 | 11 | 7.5 | 2.03 | 11 | 0.817 |
As we can see, the summary measures are identical up to two decimal places.
Here we will calculate the regreesion line for each set.
term | A | B | C | D |
---|---|---|---|---|
(Intercept) | 3.0000909 | 3.000909 | 3.0024545 | 3.0017273 |
x | 0.5000909 | 0.500000 | 0.4997273 | 0.4999091 |
As you can see, both the slope and the intercept are very close to one another to the point of being almost identical
It is evident from the graphs that despite having the same summary measures and regression coefficients and lines, the data look very different.
This becomes even clearer when you look at them on the same graph:
You can download my processing script from this link.