Sunday, March 13, 2005

Computing Statistical Summaries using SQL Functions

We will be using the sample data sets distributed with Oracle 10g for today's exercise. In particular we will be using the HR schema which is a very simple human resources data set that tracks information such as employee names, salaries, hire dates, managers and departments.

Our task is simple. We are going to analyze the salaries of employees in the HR schema. Here is the structure of the employee table:

create table (

employee_id number(6) not null,
first_name varchar2(20),
last_name varchar2(25) not null,
email varchar2(25) not null,
phone_number varchar2(20),
hire_date date not null,
job_id varchar2(10) not null,
salary number(8,2),
commission_pct number(2,2),
manager_id number(6),
department_id number(4)
);

We can compute the minimum, maximum, count, median, mean, standard deviation and variance of salary across all employees simply by using Oracle SQL functions as show in the following SQL statement

select min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees

which results in the following summary:

MIN(SALARY) MAX(SALARY) COUNT(SALARY) MEDIAN(SALARY) STDDEV(SALARY) VARIANCE(SALARY)
2100 24000 107 6200 3909.36575
15283140.5

Now let's do something more interesting. Let's summarize by department id.

select department_id, min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees group by department_id


Of course, department ids are not very meaningful to us. We would rather have the data summarized by department name. We can create a nested query against thedepartments and employees tables to do this.

select department_name, min(salary), max(salary),
count(salary), median(salary), stddev(salary), variance(salary)
from
(select department_name, salary from employees, departments
where
departments.department_id= employees.department_id)
group by department_name
order by department_name


There is actually a much simpler way to compute the same summary using Oracle's analytical SQL extensions.

select distinct department_name,
min(salary) over (partition by employees.department_id) as min,
max(salary) over (partition by employees.department_id) as max,
count(salary) over (partition by employees.department_id) as count,
median(salary) over (partition by employees.department_id) as median,
stddev(salary) over (partition by employees.department_id) as stddev,
variance(salary) over (partition by employees.department_id) as variance
from
employees, departments
where
departments.department_id= employees.department_id
order by department_name


Let's take a closer look at this query. The PARTITION BY clause breaks the entire data set into a set of groups such that each group has the same department id. Effectively, we have grouped the data by department id.

The data set operated on by the PARTITION BY clause is defined by the WHERE clause of the query. The WHERE clause of the query will return all employees that have been assigned a department. (Note: Your HR data set may contain some entries for employees that do not belong to any department. These employees might be the executives).

The query summarizes salary across each of the groups using the PARTITION BY clause. Thus, the query summarizes salary by department id for each employee.


Some readers may be wondering why we used the DISTINCT clause. Without the distinct clause, the query would have returned a row for each employee with the data summarized by department id. The DISTINCT clause eliminates the duplicates and thus only returns the summaries for each department.

DEPARTMENT_NAME MIN MAX COUNT MEDIAN STDDEV VARIANCE
Accounting 8300 12000 2 10150 2616.29509 6845000
Administration 4400 4400 1 4400 0 0
Executive 17000 24000 3 17000 4041.45188 16333333.3
Finance 6900 12000 6 8000 1801.11077 3244000
Human Resources 6500 6500 1 6500 0 0
IT 4200 9000 5 4800 1925.61678 3708000
Marketing 6000 13000 2 9500 4949.74747 24500000
Public Relations 10000 10000 1 10000 0 0
Purchasing 2500 11000 6 2850 3362.58829 11307000
Sales 6100 14000 34 8900 2033.6847 4135873.44
Shipping 2100 8200 45 3100 1488.00592 2214161.62

0 Comments:

Post a Comment

<< Home