Looking at the distribution: histograms and kernel density plots

Experimenting with different bin widths

As the data sets begin to get larger, say, \(n > 20\), another form of data visualization comes into play, the histogram. In a histogram, we no longer show the individual data points. Rather, we divide the data axis into evenly spaced intervals, called bins, and sort the data values into the bins. Then we show the number of data points in each bin as a vertical bar erected over the interval that defines that bin.

In order to make a histogram, we have to make two choices

  • a choice of cutoff points
  • a choice of bin width

Cutoff Points The cutoff points are the left and right extremes of the histogram. The most common rule is to use the minimum and maximum values of the data as the left and right cutoff points, respectively.

An exception to this rule would be when we want to compare two histograms, which is frequently done. In that case, we might want to insist that the two histograms have the same cutoff points, so the differences between the two distributions are obvious. Here, for example, we are using constant cutoff points of 40 mmHg and 260 mmHg to compare all the data sets.

The bin width is the other critical choice. It has a radical effect on the shape of the histogram. It is useful to experiment with a variety of bin widths; each choice of a bin width brings out some features of the data and suppresses others. For example let’s begin with a histogram with bin width = 1.

Bin Width = 1 Notice that there is a lot of fine structure in the histogram. There are peaks at 130 mmHg and 140 mmHg, separated by low-occupancy bins. But do these mean anything? Or are they just accidents, due to the fact that any sample of 555 data points will have some random peask and gaps? For this reason, it is not a good idea to have too many bins: it emphasizes fine structure that might be completley due to chance.

When we double the bin width to 2, there are still “too many” bins, which produces a number of random gaps and peaks.

Histogram (bin width = 1)

Bin Width = 10 It is only when we get to a bin size = 10 that these random fluctuations are suppressed, and the resulting bins are large enough to give us a decent picture of the data. We see that the distribution is unimodal (single-humped) and symmetric. We also see that the left and right tails are tapered and thin.

Histogram (bin width = 10)

Bin Width = 20 A bin size = 20 confirms these observations. But when we take our bin size = 20, we are beginning to lose key features of the dis- tribution. The gentle tapering from the center to the tails is now lost, and virtually all data are in the 2 middle bins. We have lost information about the distribution

Histogram(bin width 20)

Clearly, the choice of bin number depends on the number of data points. There are some simple rules of thumb for choosing bin number. The two most popular are the square root rule: choose the smallest whole number bigger than or equal to \(\sqrt{n}\), where n is the number of data points, and the Rice Rule: choose the smallest whole number bigger than or equal to \(2 \times \left( n^{1/3} \right) \). In this case, where \( n = 555 \), the square root rule gives \( \sqrt{555} \approx 23 \), while the Rice Rule gives a bin number of \( 2 \times \left( 555 ^ {1/3}\right) \approx 16 \), requiring 16 bins. There are other more sophisticated rules, that take into account the variability of the data.

Histogram: experiment with different bin widths

bin width

Cumulative histograms

Since the histogram can be very sensitive to the choice of bin width, a more robust presentation is sometimes used: the cumulative histogram, in which each bin is given a height which is not the number of data points in that bin, but rather the number of data points in all bins less than or equal to it. Thus it is keeping a cumulative sum.

One advantage of the cumulative histogram is that we can easily read off the median value of the data (the halfway point, see next section) . We just find the halfway point on the vertical axis, in this case that would be 50, and then look to see what data value this corresponds to

Cumulative Histogram

Number of Bins

Kernel density estimates

Kernel density plot Another way to look at the distribution of the data is to make what is called a kernel density plot. We want to estimate the shape of the distribution, and a good way to do that is to use a set of elementary functions to represent each data point, and then let the elementary functions melt together or blend to make a picture of a continuous distribution that represents the dataset.

The elementary function we use can vary: it could be little triangles, little rectangles, or little smooth curves like the bell curve \( Y = e^{−\frac{X^2}{\sigma}} \). The key is that each of these functions has a parameter that controls its width. In the case of little rectangles or triangles, the parameter is the width of the base, and in the case of the bell curves, the parameter is \( \sigma \): the bigger the \( \sigma \), the wider the curve.

Bell curve \( Y = e^{-\frac{X^2}{\sigma}}\)

Kernel density estimate To form a kernel density estimate, we start with narrow kernel functions, so narrow that each kernel surrounds one data value. Then we let the width get slowly bigger and bigger until the many little curves have merged into a single smooth curve. As sigma gets bigger, the narrower kernels “melt” into a smoother function. This is called a kernel density estimate of the distribution.

Kernel Density Estimate Illustration

Here we will use bell curves as our kernel density functions. This is the most common choice. The parameter that controls the width is \( \sigma \): the bigger \( \sigma \) is, the wider is the function.

Kernel Density Estimate

Number of Bins