Looking at the distribution: histograms and kernel density plots
Experimenting with different bin widths
As the data sets begin to get larger, say, \(n > 20\), another form of data visualization comes into play, the histogram. In a histogram, we no longer show the individual data points. Rather, we divide the data axis into evenly spaced intervals, called bins, and sort the data values into the bins. Then we show the number of data points in each bin as a vertical bar erected over the interval that defines that bin.
In order to make a histogram, we have to make two choices
- a choice of cutoff points
- a choice of bin width
Cutoff Points The cutoff points are the left and right extremes of the histogram. The most common rule is to use the minimum and maximum values of the data as the left and right cutoff points, respectively.
An exception to this rule would be when we want to compare two histograms, which is frequently done. In that case, we might want to insist that the two histograms have the same cutoff points, so the differences between the two distributions are obvious. Here, for example, we are using constant cutoff points of 40 mmHg and 260 mmHg to compare all the data sets.
Bin Width = 1 Notice that there is a lot of fine structure in the histogram. There are peaks at 130 mmHg and 140 mmHg, separated by low-occupancy bins. But do these mean anything? Or are they just accidents, due to the fact that any sample of 555 data points will have some random peask and gaps? For this reason, it is not a good idea to have too many bins: it emphasizes fine structure that might be completley due to chance.
When we double the bin width to 2, there are still “too many” bins, which produces a number of random gaps and peaks.
Histogram (bin width = 1)
Bin Width = 10 It is only when we get to a bin size = 10 that these random fluctuations are suppressed, and the resulting bins are large enough to give us a decent picture of the data. We see that the distribution is unimodal (single-humped) and symmetric. We also see that the left and right tails are tapered and thin.
Histogram (bin width = 10)
Bin Width = 20 A bin size = 20 confirms these observations. But when we take our bin size = 20, we are beginning to lose key features of the dis- tribution. The gentle tapering from the center to the tails is now lost, and virtually all data are in the 2 middle bins. We have lost information about the distribution
Histogram(bin width 20)
Histogram: experiment with different bin widths
Cumulative histograms
Since the histogram can be very sensitive to the choice of bin width, a more robust presentation is sometimes used: the cumulative histogram, in which each bin is given a height which is not the number of data points in that bin, but rather the number of data points in all bins less than or equal to it. Thus it is keeping a cumulative sum.
One advantage of the cumulative histogram is that we can easily read off the median value of the data (the halfway point, see next section) . We just find the halfway point on the vertical axis, in this case that would be 50, and then look to see what data value this corresponds to
Cumulative Histogram
Kernel density estimates
Kernel density plot Another way to look at the distribution of the data is to make what is called a kernel density plot. We want to estimate the shape of the distribution, and a good way to do that is to use a set of elementary functions to represent each data point, and then let the elementary functions melt together or blend to make a picture of a continuous distribution that represents the dataset.
The elementary function we use can vary: it could be little triangles, little rectangles, or little smooth curves like the bell curve \( Y = e^{−\frac{X^2}{\sigma}} \). The key is that each of these functions has a parameter that controls its width. In the case of little rectangles or triangles, the parameter is the width of the base, and in the case of the bell curves, the parameter is \( \sigma \): the bigger the \( \sigma \), the wider the curve.
Bell curve \( Y = e^{-\frac{X^2}{\sigma}}\)
Kernel density estimate To form a kernel density estimate, we start with narrow kernel functions, so narrow that each kernel surrounds one data value. Then we let the width get slowly bigger and bigger until the many little curves have merged into a single smooth curve. As sigma gets bigger, the narrower kernels “melt” into a smoother function. This is called a kernel density estimate of the distribution.
Kernel Density Estimate Illustration
Here we will use bell curves as our kernel density functions. This is the most common choice. The parameter that controls the width is \( \sigma \): the bigger \( \sigma \) is, the wider is the function.