Quant Meditations: Computing Statistical Summaries using SQL Functions

We will be using the sample data sets distributed with Oracle 10g for today's exercise. In particular we will be using the HR schema which is a very simple human resources data set that tracks information such as employee names, salaries, hire dates, managers and departments.

Our task is simple. We are going to analyze the salaries of employees in the HR schema. Here is the structure of the employee table:

create table (
employee_id number(6) not null,
first_name varchar2(20),
last_name varchar2(25) not null,
email varchar2(25) not null,
phone_number varchar2(20),
hire_date date not null,
job_id varchar2(10) not null,
salary number(8,2),
commission_pct number(2,2),
manager_id number(6),
department_id number(4)
);

We can compute the minimum, maximum, count, median, mean, standard deviation and variance of salary across all employees simply by using Oracle SQL functions as show in the following SQL statement

select min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees

which results in the following summary:

MIN(SALARY)	MAX(SALARY)	COUNT(SALARY)	MEDIAN(SALARY)	STDDEV(SALARY)	VARIANCE(SALARY)
2100	24000	107	6200	3909.36575	15283140.5

Now let's do something more interesting. Let's summarize by department id.

select department_id, min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees group by department_id

Of course, department ids are not very meaningful to us. We would rather have the data summarized by department name. We can create a nested query against thedepartments and employees tables to do this.

select department_name, min(salary), max(salary),
count(salary), median(salary), stddev(salary), variance(salary)
from
(select department_name, salary from employees, departments
where
departments.department_id= employees.department_id)
group by department_name
order by department_name

There is actually a much simpler way to compute the same summary using Oracle's analytical SQL extensions.

select distinct department_name,
min(salary) over (partition by employees.department_id) as min,
max(salary) over (partition by employees.department_id) as max,
count(salary) over (partition by employees.department_id) as count,
median(salary) over (partition by employees.department_id) as median,
stddev(salary) over (partition by employees.department_id) as stddev,
variance(salary) over (partition by employees.department_id) as variance
from
employees, departments
where
departments.department_id= employees.department_id
order by department_name

Let's take a closer look at this query. The PARTITION BY clause breaks the entire data set into a set of groups such that each group has the same department id. Effectively, we have grouped the data by department id.

The data set operated on by the PARTITION BY clause is defined by the WHERE clause of the query. The WHERE clause of the query will return all employees that have been assigned a department. (Note: Your HR data set may contain some entries for employees that do not belong to any department. These employees might be the executives).

The query summarizes salary across each of the groups using the PARTITION BY clause. Thus, the query summarizes salary by department id for each employee.

Some readers may be wondering why we used the DISTINCT clause. Without the distinct clause, the query would have returned a row for each employee with the data summarized by department id. The DISTINCT clause eliminates the duplicates and thus only returns the summaries for each department.

DEPARTMENT_NAME	MIN	MAX	COUNT	MEDIAN	STDDEV	VARIANCE
Accounting	8300	12000	2	10150	2616.29509	6845000
Administration	4400	4400	1	4400	0	0
Executive	17000	24000	3	17000	4041.45188	16333333.3
Finance	6900	12000	6	8000	1801.11077	3244000
Human Resources	6500	6500	1	6500	0	0
IT	4200	9000	5	4800	1925.61678	3708000
Marketing	6000	13000	2	9500	4949.74747	24500000
Public Relations	10000	10000	1	10000	0	0
Purchasing	2500	11000	6	2850	3362.58829	11307000
Sales	6100	14000	34	8900	2033.6847	4135873.44
Shipping	2100	8200	45	3100	1488.00592	2214161.62

Quant Meditations

Sunday, March 13, 2005

Computing Statistical Summaries using SQL Functions

0 Comments:

About Me

Previous Posts