Visualizing and presenting data
What are data?
We will be mostly concerned here with data that is a number, or numbers, that describe some feature(s) we are measuring. The most typical situation is when we have a set of subjects \( 1, 2, 3, \dots, n \) and we measure a quantity \(x\) for each subject. We will call these numbers \( x_1, x_2, \dots, x_n \). In this case we speak of a data set \( \{x_1, x_2, \dots, x_n\} \) describing the group. For example, we might have a group of people, and we measure a systolic blood pressure (sbp) reading for each of them. Blood pressures are generally measured in units of mmHg or millimeters of mercury, which is a standard of pressure.
In this case the data set might look like:
Subject No. | Name | Blood Pressure (mmHg) |
---|---|---|
1 | Sally E | \(x_1 = 108\) |
2 | Bob J | \(x_2 = 126\) |
⋮ | ⋮ | ⋮ |
n | Steve W | \(x_n = 118\) |
The first thing to do with any data set is: look at it! This may seem obvious, but it is surprising how often this is not done, with serious consequences.
How to look at data
Dot Plot In the case where the data set is small to middling, say \( n < 100 \), the most important visualization is the dot plot: we simply plot the data values as points on a single axis (horizontal or vertical).
Horizontal/Vertical dot plot
In this chapter, we will use the horizontal presentation, but in later chapters, when we are comparing several groups, we will use the vertical presentation.
When the data set gets larger than 20 or so, there is a real chance that data points might overlap, giving us a misleading picture of the data. Consider the four different data sets. If the data set is small \(n = 10\) (first row) there is no problem with the dot plot, and it shows the data nicely. For \(n = 20\) (second row) there is already some overlapping of data points: only 17 dots can be seen; the other 3 are overlapped. By the time \(n = 50\) (third row) the degree of overlap is substantial, so the plot mis-represents the data, and when \(n = 100\) (bottom row) the mis- representation is much worse.
For these larger data sets, we need a method for preventing overlap of the data points.
Dot plots for 4 different data sets of increasing size. Note the overlap of data points in the larger sets
Jitter One is the use of “jitter”. Since the data is one-dimensional, we can use the second dimension of the page to spread the data out. We create a fictitious second axis, and then add a small random quantity to each data point in this (meaningless) axis. This separates equal or close data points so they don’t overlap. The additional fictitious axis is meaningless.
Dot plots with jitter for 4 different data sets of increasing size. Note the overlap of data points in the larger sets
Beeswarm A variant of the jitter plot is the “beeswarm” plot. Here, when data points would otherwise overlap, we use a fixed spacing in the fictitious axis to offset each data point.
Dotplot ⇝ Beeswarm Plot
Looking at the three forms of presentation, it is clear that the simple dot plot is fine for small data sets, but that for larger ones, say n > 20, the beeeswarm plot gives the best picture of the data.
What to look for in a data plot
The first and most important step in the analysis of the data is to look at the visual presentation. The true picture of the data set is its distribution, and we must always begin by looking at and see what it is trying to tell us. Some important features to look for are:
Bunching up of data (symmetry vs. skew) Are the data fairly uniformly distributed over their range, or do they bunch up? If they do bunch up, are the preferred values in the middle, making the distribution fairly symmetric? Or do they bunch up on the left, which is called skewed to the right, or bunch up on the right, which is called skewed to the left? (Think of the term skew as meaning sticking out asymmetrically, so if the data are bunched to the left, the distribution is sticking out asymmetrically to the right.)
Skewed Left/Right
Outliers Are all the data in the same general range, or are there some points that are “way off” from the others? These points are often called “outliers”.
Outliers
Subgroups Are there noticeable gaps between bunches of data? In other words, does the data set seem to group into distinct subgroups, like a low group and a high group? Or does it seem to be continuously varying, with no noticeable gaps?
Subgroups
Later on, we learn methods for turning all these visual impressions into precise mathematical concepts. But it will always be true that the visual impression is the first step.