Sunday, March 20, 2005

Creating histograms using the SQL function WIDTH_BUCKET

Histograms are one of the mostly commonly used statistical tools for studying data distributions. Statistician use histograms to visualize the shape of the data set and determine if there are gaps or clusters in the data set.

Histogram are actually just specialized bar charts. To construct a histogram, divide the data range into a number of equal width buckets and then count the number of data points in each bucket. The real art of creating a histogram is choosing the number of buckets. More on that later.

Below is a histogram that shows the distribution of employee monthly salaries of the employees in the HR table.



This histogram has 11 data range buckets. The histogram tells us that most of our employees have monthly salaries between $2,100 and $4,290. We have no employees with monthly salaries between $17,430 and $24,000. We have one outlier employee (probably the CEO) who has a monthly salary greater than or equal to $24,000.

So, let's assume we wanted to compute this histogram using Oracle SQL. Well, that's easy. Oracle provides a SQL function called WIDTH_BUCKET that tells us which bucket a data point belongs to.

For example, the following snippet of SQL divides the HR schema employee monthly salary data range from $2,100 to $24,000 into 10 equal width buckets and assigns each data point to one of those buckets.

select employee_id, salary, WIDTH_BUCKET(salary, 2100, 24000, 10) from employees

Thus, the width of each bucket will be (24,000 - 2,100)/10 = 2,190. The first bucket will range from $2,100 to $4,290. There are 46 employees that fall into this bucket. The data range for the buckets is inclusive at the low end and exclusive at the high end.

select count(1) from employees where salary >= 2100 and < 4590

Now, comes the fun part. Let's use WIDTH_BUCKET to compute the distribution of employee salaries.


select bucket, 2100+(bucket-1)*2190 as low, 2100+bucket*2190 as high , count(1) as cnt
from
(
select width_bucket(salary, 2100, 24000, 10) as bucket from employees
)
group by bucket order by bucket


BUCKET LOW HIGH CNT
---------- ---------- ---------- ----------
1 2100 4290 46
2 4290 6480 10
3 6480 8670 23
4 8670 10860 15
5 10860 13050 8
6 13050 15240 2
7 15240 17430 2
11 24000 26190 1


Wait a minute, there are 11 buckets in the results but we only asked for 10 buckets. What happened? Remember that the lower bucket bound is inclusive and the upper bucket bound is exclusive. We have an employee that make $24,000 per month. He can't be counted in the $21,810 to $24,000 bucket. We have to add $24,000-$26,190 bucket to accommodate him. Oracle handles this automatically and creates a 11th bucket even though we requested 10 buckets.

I mentioned earlier that the choice of the number buckets influences the effectiveness of the histogram. There is no single best way for choosing the number of buckets. I recommend the following two heuristics.


Number of buckets = 1 + 3.3 x log(N) for data sets which N >= 15


or


Number of buckets = (max value - min value) / (2 x IQR x n1/3)


N is the number of data points in the data set.
IQR is the interquartile range. i.e. (75th quartile value - 25th quartile value). The 25th and 75th quartile values can be compute using DBMS_STATS_FUNC.SUMMARY.

You can read "How Wide is Your Bin" for a good summary of different techniques for choosing the number of buckets. B.T.W., histogram buckets are also commonly referred to as bins by some people.

This website has a Java applet that let's you vary the number of histogram buckets to see the effects on an histogram. It's definitely worth exploring.

0 Comments:

Post a Comment

<< Home