Tuesday, March 22, 2005

Computing Interquartile Range

There are two ways of measuring the spread of a data distribution. Standard Deviation measures the spread of a data distribution from the mean and is thus influenced by outliers. Interquartile Range (IQR) measures the spread of a data distribution from the median and is thus not influenced by outliers.

When you look at boxplots, the box represents the IQR.

The formula for IQR is simple:

IQR = Q3 - Q1

where Q3 is the 75th Quantile and Q1 is the 25th Quantile. You can used the SUMMARY function provided by the PL/SQL package DBMS_STAT_FUNCS to compute Q3 and Q1. Unfortunately, the DBMS_STAT_FUNCS package does not provide a way to subset the data.

So, today we will use SQL to compute Q3 and Q1. First, divide the data set into a high and low group split at the median. Q3 is the median of the high group and Q1 is the median of the low group.

The only catch is that if the data set has an odd number of values, we have to include the median of the entire data set in the high and low groups.

I am going to use the Oracle sample HR data set to illustrate IQR computation.

Step 1: Compute the median of the data set and determine whether the number of elements in the data set is even or odd

select median(salary), mod(count(1), 2) as isOdd from employees

The median salary is $6,200 and isOdd is 1 which means that we have an odd number values in our data set. There are 107 employees in this data set.

Step 2: Compute Q1

select median(salary) as Q1 from employees where salary <= 6200

If we had an even number of data values, we would do the following:

select median(salary) as Q1 from employees where salary <>

Step 3: Compute Q3

select median(salary) as Q3 from employees where salary >= 6200

If we had an even number of data values, we would do the following:

select median(salary) as Q3 from employees where salary > 6200

Step 4: Compute IQR

Step 2 and Step 3 result in a Q1 of $3,100 and a Q3 of $8,900. Thus,

IQR = Q3 - Q1 = 8900 - 3100 = 5800

Sunday, March 20, 2005

Creating histograms using the SQL function WIDTH_BUCKET

Histograms are one of the mostly commonly used statistical tools for studying data distributions. Statistician use histograms to visualize the shape of the data set and determine if there are gaps or clusters in the data set.

Histogram are actually just specialized bar charts. To construct a histogram, divide the data range into a number of equal width buckets and then count the number of data points in each bucket. The real art of creating a histogram is choosing the number of buckets. More on that later.

Below is a histogram that shows the distribution of employee monthly salaries of the employees in the HR table.



This histogram has 11 data range buckets. The histogram tells us that most of our employees have monthly salaries between $2,100 and $4,290. We have no employees with monthly salaries between $17,430 and $24,000. We have one outlier employee (probably the CEO) who has a monthly salary greater than or equal to $24,000.

So, let's assume we wanted to compute this histogram using Oracle SQL. Well, that's easy. Oracle provides a SQL function called WIDTH_BUCKET that tells us which bucket a data point belongs to.

For example, the following snippet of SQL divides the HR schema employee monthly salary data range from $2,100 to $24,000 into 10 equal width buckets and assigns each data point to one of those buckets.

select employee_id, salary, WIDTH_BUCKET(salary, 2100, 24000, 10) from employees

Thus, the width of each bucket will be (24,000 - 2,100)/10 = 2,190. The first bucket will range from $2,100 to $4,290. There are 46 employees that fall into this bucket. The data range for the buckets is inclusive at the low end and exclusive at the high end.

select count(1) from employees where salary >= 2100 and < 4590

Now, comes the fun part. Let's use WIDTH_BUCKET to compute the distribution of employee salaries.


select bucket, 2100+(bucket-1)*2190 as low, 2100+bucket*2190 as high , count(1) as cnt
from
(
select width_bucket(salary, 2100, 24000, 10) as bucket from employees
)
group by bucket order by bucket


BUCKET LOW HIGH CNT
---------- ---------- ---------- ----------
1 2100 4290 46
2 4290 6480 10
3 6480 8670 23
4 8670 10860 15
5 10860 13050 8
6 13050 15240 2
7 15240 17430 2
11 24000 26190 1


Wait a minute, there are 11 buckets in the results but we only asked for 10 buckets. What happened? Remember that the lower bucket bound is inclusive and the upper bucket bound is exclusive. We have an employee that make $24,000 per month. He can't be counted in the $21,810 to $24,000 bucket. We have to add $24,000-$26,190 bucket to accommodate him. Oracle handles this automatically and creates a 11th bucket even though we requested 10 buckets.

I mentioned earlier that the choice of the number buckets influences the effectiveness of the histogram. There is no single best way for choosing the number of buckets. I recommend the following two heuristics.


Number of buckets = 1 + 3.3 x log(N) for data sets which N >= 15


or


Number of buckets = (max value - min value) / (2 x IQR x n1/3)


N is the number of data points in the data set.
IQR is the interquartile range. i.e. (75th quartile value - 25th quartile value). The 25th and 75th quartile values can be compute using DBMS_STATS_FUNC.SUMMARY.

You can read "How Wide is Your Bin" for a good summary of different techniques for choosing the number of buckets. B.T.W., histogram buckets are also commonly referred to as bins by some people.

This website has a Java applet that let's you vary the number of histogram buckets to see the effects on an histogram. It's definitely worth exploring.

Saturday, March 19, 2005

On a different note: The Universe Is A Strange Place

Worth checking out, Frank Wilczeck's talk on strong force.

Monday, March 14, 2005

Computing Statistical Summaries using PL/SQL DBMS_STAT_FUNCS package

The PL/SQL DBMS_STAT_FUNCS package can be used to compute statistical summary of numerical column of a table.

The following PL/SQL code fragemnt summarizes employee salaries in the Oracle 10g HR schema.

DECLARE
sigma number := 3;
S DBMS_STAT_FUNCS.summaryType;
item number;
cnt number;
BEGIN
DBMS_STAT_FUNCS.SUMMARY ('hr', 'employees', 'salary', sigma, S);
DBMS_OUTPUT.PUT_LINE('COUNT = ' || S.count);
DBMS_OUTPUT.PUT_LINE('MIN = ' || S.min);
DBMS_OUTPUT.PUT_LINE('MAX = ' || S.max);
DBMS_OUTPUT.PUT_LINE('RANGE = ' || S.range);
DBMS_OUTPUT.PUT_LINE('VARIANCE = ' || S.variance);
DBMS_OUTPUT.PUT_LINE('STDDEV = ' || S.stddev);
DBMS_OUTPUT.PUT_LINE('5th QUANTILE = ' || S.quantile_5);
DBMS_OUTPUT.PUT_LINE('25th QUANTILE = ' || S.quantile_25);
DBMS_OUTPUT.PUT_LINE('MEDIAN = ' || S.median);
DBMS_OUTPUT.PUT_LINE('75th QUANTILE = ' || S.quantile_75);
DBMS_OUTPUT.PUT_LINE('95th QUANTILE = ' || S.quantile_95);
DBMS_OUTPUT.PUT_LINE('PLUS X SIGMA = ' || S.plus_x_sigma);
DBMS_OUTPUT.PUT_LINE('MINUS X SIGMA = ' || S.minus_x_sigma);

FOR item IN S.cmode.FIRST..S.cmode.LAST
LOOP
DBMS_OUTPUT.PUT_LINE('MODE [' || item || '] = ' || S.cmode(item));
END LOOP;

FOR item IN S.top_5_values.FIRST..S.top_5_values.LAST
LOOP
DBMS_OUTPUT.PUT_LINE('TOP ' || item || ' VALUE = ' || S.top_5_values(item));
END LOOP;

cnt := S.bottom_5_values.LAST;
FOR item IN S.bottom_5_values.FIRST..S.bottom_5_values.LAST
LOOP
DBMS_OUTPUT.PUT_LINE('BOTTOM ' || cnt || ' VALUE = ' || S.bottom_5_values(item));
cnt := cnt - 1;
END LOOP;

END;
/


The DBMS_STAT_FUNCS.SUMMARY() function's argument signature is:


DBMS_STAT_FUNCS.SUMMARY (
ownername IN VARCHAR2,
tablename IN VARCHAR2,
columnname IN VARCHAR2,
sigma_value IN NUMBER DEFAULT 3,
s OUT SummaryType);


SummaryType is an object defined by the DBMS_STAT_FUNCS package. It's structure is:

TYPE n_arr IS VARRAY(5) of NUMBER;
TYPE num_table IS TABLE of NUMBER;
TYPE summaryType IS RECORD (
count NUMBER,
min NUMBER,
max NUMBER,
range NUMBER,
mean NUMBER,
cmode num_table,
variance NUMBER,
stddev NUMBER,
quantile_5 NUMBER,
quantile_25 NUMBER,
median NUMBER,
quantile_75 NUMBER,
quantile_95 NUMBER,
plus_x_sigma NUMBER,
minus_x_sigma NUMBER,
extreme_values num_table,
top_5_values n_arr,
bottom_5_values n_arr);


The limitation of the DBMS_STAT_FUNCS.SUMMARY() function is that there is no way to subset the data. The SUMMARY() function computes the summary across all rows in the table for the specified column.

The DBMS_STAT_FUNCS package also provides exponentional, normal, poisson, uniform and weibull distribution fitting functions. We will explore those in the future.

Sunday, March 13, 2005

Computing Statistical Summaries using SQL Functions

We will be using the sample data sets distributed with Oracle 10g for today's exercise. In particular we will be using the HR schema which is a very simple human resources data set that tracks information such as employee names, salaries, hire dates, managers and departments.

Our task is simple. We are going to analyze the salaries of employees in the HR schema. Here is the structure of the employee table:

create table (

employee_id number(6) not null,
first_name varchar2(20),
last_name varchar2(25) not null,
email varchar2(25) not null,
phone_number varchar2(20),
hire_date date not null,
job_id varchar2(10) not null,
salary number(8,2),
commission_pct number(2,2),
manager_id number(6),
department_id number(4)
);

We can compute the minimum, maximum, count, median, mean, standard deviation and variance of salary across all employees simply by using Oracle SQL functions as show in the following SQL statement

select min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees

which results in the following summary:

MIN(SALARY) MAX(SALARY) COUNT(SALARY) MEDIAN(SALARY) STDDEV(SALARY) VARIANCE(SALARY)
2100 24000 107 6200 3909.36575
15283140.5

Now let's do something more interesting. Let's summarize by department id.

select department_id, min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees group by department_id


Of course, department ids are not very meaningful to us. We would rather have the data summarized by department name. We can create a nested query against thedepartments and employees tables to do this.

select department_name, min(salary), max(salary),
count(salary), median(salary), stddev(salary), variance(salary)
from
(select department_name, salary from employees, departments
where
departments.department_id= employees.department_id)
group by department_name
order by department_name


There is actually a much simpler way to compute the same summary using Oracle's analytical SQL extensions.

select distinct department_name,
min(salary) over (partition by employees.department_id) as min,
max(salary) over (partition by employees.department_id) as max,
count(salary) over (partition by employees.department_id) as count,
median(salary) over (partition by employees.department_id) as median,
stddev(salary) over (partition by employees.department_id) as stddev,
variance(salary) over (partition by employees.department_id) as variance
from
employees, departments
where
departments.department_id= employees.department_id
order by department_name


Let's take a closer look at this query. The PARTITION BY clause breaks the entire data set into a set of groups such that each group has the same department id. Effectively, we have grouped the data by department id.

The data set operated on by the PARTITION BY clause is defined by the WHERE clause of the query. The WHERE clause of the query will return all employees that have been assigned a department. (Note: Your HR data set may contain some entries for employees that do not belong to any department. These employees might be the executives).

The query summarizes salary across each of the groups using the PARTITION BY clause. Thus, the query summarizes salary by department id for each employee.


Some readers may be wondering why we used the DISTINCT clause. Without the distinct clause, the query would have returned a row for each employee with the data summarized by department id. The DISTINCT clause eliminates the duplicates and thus only returns the summaries for each department.

DEPARTMENT_NAME MIN MAX COUNT MEDIAN STDDEV VARIANCE
Accounting 8300 12000 2 10150 2616.29509 6845000
Administration 4400 4400 1 4400 0 0
Executive 17000 24000 3 17000 4041.45188 16333333.3
Finance 6900 12000 6 8000 1801.11077 3244000
Human Resources 6500 6500 1 6500 0 0
IT 4200 9000 5 4800 1925.61678 3708000
Marketing 6000 13000 2 9500 4949.74747 24500000
Public Relations 10000 10000 1 10000 0 0
Purchasing 2500 11000 6 2850 3362.58829 11307000
Sales 6100 14000 34 8900 2033.6847 4135873.44
Shipping 2100 8200 45 3100 1488.00592 2214161.62

Monday, March 07, 2005

What's to come and the HR data set

Over the next few months, I am going to explore the embedded statistical analysis capabilities of the Oracle database.

Each week, I will explore a new Oracle statistical function using the HR schema that is part Oracle database sample schemas.

From time to time, I may use my own dataset in which case, I will make that dataset downloadable.

Have fun and feel free to write to me with your comments.