Computing Statistical Summaries using SQL Functions
We will be using the sample data sets distributed with Oracle 10g for today's exercise. In particular we will be using the HR schema which is a very simple human resources data set that tracks information such as employee names, salaries, hire dates, managers and departments.
Our task is simple. We are going to analyze the salaries of employees in the HR schema. Here is the structure of the employee table:
create table (
employee_id number(6) not null,
first_name varchar2(20),
last_name varchar2(25) not null,
email varchar2(25) not null,
phone_number varchar2(20),
hire_date date not null,
job_id varchar2(10) not null,
salary number(8,2),
commission_pct number(2,2),
manager_id number(6),
department_id number(4)
);
We can compute the minimum, maximum, count, median, mean, standard deviation and variance of salary across all employees simply by using Oracle SQL functions as show in the following SQL statement
select min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees
which results in the following summary:
Now let's do something more interesting. Let's summarize by department id.
select department_id, min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees group by department_id
Of course, department ids are not very meaningful to us. We would rather have the data summarized by department name. We can create a nested query against thedepartments and employees tables to do this.
select department_name, min(salary), max(salary),
count(salary), median(salary), stddev(salary), variance(salary)
from
(select department_name, salary from employees, departments
where
departments.department_id= employees.department_id)
group by department_name
order by department_name
There is actually a much simpler way to compute the same summary using Oracle's analytical SQL extensions.
select distinct department_name,
min(salary) over (partition by employees.department_id) as min,
max(salary) over (partition by employees.department_id) as max,
count(salary) over (partition by employees.department_id) as count,
median(salary) over (partition by employees.department_id) as median,
stddev(salary) over (partition by employees.department_id) as stddev,
variance(salary) over (partition by employees.department_id) as variance
from
employees, departments
where
departments.department_id= employees.department_id
order by department_name
Let's take a closer look at this query. The PARTITION BY clause breaks the entire data set into a set of groups such that each group has the same department id. Effectively, we have grouped the data by department id.
The data set operated on by the PARTITION BY clause is defined by the WHERE clause of the query. The WHERE clause of the query will return all employees that have been assigned a department. (Note: Your HR data set may contain some entries for employees that do not belong to any department. These employees might be the executives).
The query summarizes salary across each of the groups using the PARTITION BY clause. Thus, the query summarizes salary by department id for each employee.
Some readers may be wondering why we used the DISTINCT clause. Without the distinct clause, the query would have returned a row for each employee with the data summarized by department id. The DISTINCT clause eliminates the duplicates and thus only returns the summaries for each department.
Our task is simple. We are going to analyze the salaries of employees in the HR schema. Here is the structure of the employee table:
create table (
employee_id number(6) not null,
first_name varchar2(20),
last_name varchar2(25) not null,
email varchar2(25) not null,
phone_number varchar2(20),
hire_date date not null,
job_id varchar2(10) not null,
salary number(8,2),
commission_pct number(2,2),
manager_id number(6),
department_id number(4)
);
We can compute the minimum, maximum, count, median, mean, standard deviation and variance of salary across all employees simply by using Oracle SQL functions as show in the following SQL statement
select min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees
which results in the following summary:
MIN(SALARY) | MAX(SALARY) | COUNT(SALARY) | MEDIAN(SALARY) | STDDEV(SALARY) | VARIANCE(SALARY) |
---|---|---|---|---|---|
2100 | 24000 | 107 | 6200 | 3909.36575 | 15283140.5 |
Now let's do something more interesting. Let's summarize by department id.
select department_id, min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees group by department_id
Of course, department ids are not very meaningful to us. We would rather have the data summarized by department name. We can create a nested query against thedepartments and employees tables to do this.
select department_name, min(salary), max(salary),
count(salary), median(salary), stddev(salary), variance(salary)
from
(select department_name, salary from employees, departments
where
departments.department_id= employees.department_id)
group by department_name
order by department_name
There is actually a much simpler way to compute the same summary using Oracle's analytical SQL extensions.
select distinct department_name,
min(salary) over (partition by employees.department_id) as min,
max(salary) over (partition by employees.department_id) as max,
count(salary) over (partition by employees.department_id) as count,
median(salary) over (partition by employees.department_id) as median,
stddev(salary) over (partition by employees.department_id) as stddev,
variance(salary) over (partition by employees.department_id) as variance
from
employees, departments
where
departments.department_id= employees.department_id
order by department_name
Let's take a closer look at this query. The PARTITION BY clause breaks the entire data set into a set of groups such that each group has the same department id. Effectively, we have grouped the data by department id.
The data set operated on by the PARTITION BY clause is defined by the WHERE clause of the query. The WHERE clause of the query will return all employees that have been assigned a department. (Note: Your HR data set may contain some entries for employees that do not belong to any department. These employees might be the executives).
The query summarizes salary across each of the groups using the PARTITION BY clause. Thus, the query summarizes salary by department id for each employee.
Some readers may be wondering why we used the DISTINCT clause. Without the distinct clause, the query would have returned a row for each employee with the data summarized by department id. The DISTINCT clause eliminates the duplicates and thus only returns the summaries for each department.
DEPARTMENT_NAME | MIN | MAX | COUNT | MEDIAN | STDDEV | VARIANCE |
---|---|---|---|---|---|---|
Accounting | 8300 | 12000 | 2 | 10150 | 2616.29509 | 6845000 |
Administration | 4400 | 4400 | 1 | 4400 | 0 | 0 |
Executive | 17000 | 24000 | 3 | 17000 | 4041.45188 | 16333333.3 |
Finance | 6900 | 12000 | 6 | 8000 | 1801.11077 | 3244000 |
Human Resources | 6500 | 6500 | 1 | 6500 | 0 | 0 |
IT | 4200 | 9000 | 5 | 4800 | 1925.61678 | 3708000 |
Marketing | 6000 | 13000 | 2 | 9500 | 4949.74747 | 24500000 |
Public Relations | 10000 | 10000 | 1 | 10000 | 0 | 0 |
Purchasing | 2500 | 11000 | 6 | 2850 | 3362.58829 | 11307000 |
Sales | 6100 | 14000 | 34 | 8900 | 2033.6847 | 4135873.44 |
Shipping | 2100 | 8200 | 45 | 3100 | 1488.00592 | 2214161.62 |
0 Comments:
Post a Comment
<< Home