Misleading Summaries

Exploring Data Through Numbers and Graphics

Dr. Ali Sajid Imami

March 20, 2018

Summary Measures

Sumary measures are very important in basic statistical work. Similarly, graphical representation of data is also useful for exploring the data. A common mistake that we usually make is relying on just one and ignoring the other. This can be inconvenient at best and downright misleading at worst.

Anscombe’s Quartet

We will be exploring the interplay of graphics and data using Anscombe’s Quartet. This data has the same, or close enough to the same summary measures including:

Mean
Variance
Correlation
Regression Lines

However, all datasets are indeed very different.

Anscombe’s Quartet Summary Measures

set	Mean x	SD x	Var x	Mean y	SD y	Var y	Correlation
A	9	3.32	11	7.5	2.03	11	0.816
B	9	3.32	11	7.5	2.03	11	0.816
C	9	3.32	11	7.5	2.03	11	0.816
D	9	3.32	11	7.5	2.03	11	0.817

As we can see, the summary measures are identical up to two decimal places.

Regression Lines

Here we will calculate the regreesion line for each set.

term	A	B	C	D
(Intercept)	3.0000909	3.000909	3.0024545	3.0017273
x	0.5000909	0.500000	0.4997273	0.4999091

As you can see, both the slope and the intercept are very close to one another to the point of being almost identical

Set A Graphical Representation

Set B Graphical Representation

Set C Graphical Representation

Set D Graphical Representation

Discrepancy in the Graphs

It is evident from the graphs that despite having the same summary measures and regression coefficients and lines, the data look very different.

This becomes even clearer when you look at them on the same graph:

Appendix

You can download my processing script from this link.