Quant Meditations

What's next in easy-to-use BI? Biggle

Interesting idea from Gartner:

What's next in easy-to-use business intelligence? Gartner has a concept it calls "Biggle" -- the intersection of BI and Google. The idea is that the data warehousing software will be so sophisticated that it understands when different people use different words to describe the same concepts or products. It creates an index of related information -- á là Google -- and dishes relevant results out in response to queries.

In computer science, they refer to this capability as non-obvious relationship awareness. "Nobody's doing this yet," says Gartner's Beyer. Judging from the speed of recent advances in business intelligence, though, it may not be long before companies add the term "Biggling" to their tech lexicon.

Web Databases & Applications

During the dot-com boom, I used an online database creation and publishing service called Bitlocker to create a bug tracker for my start-up. At the time, we had 20 people using our Bitlocker based bug tracker spread over the US and India.

Bitlocker was an unique product at the time. It allowed you to create a collaborative web-based application without a lot of programming. They provided templates for applications like CRM, bug tracking, recipe management, employee review tracking, etc. It was sort of like FileMaker Pro for the web for free!

While Bitlocker was a WYSIWYG application, it exposed the underlying schema to the end-user and so you did need to know a little bit about databases to program with it. Also, modifying the structure or the UI was not simple and led to all sorts of data inconsistency issues.

The service eventually failed for the same reason many other dot-com companies died. It wasn't profitable.

Today, I think there are two interesting alternatives to Bitlocker.

First, there is Oracle's HTML DB. HTML DB was originally Oracle's answer to Microsoft Access. But in reality, it is so much more. You can use it to build complex and scalable applications very quickly. It does require a fair amount programming ability (SQL & PL/SQL) to do anything complex. BTW., it is free if you own an Oracle database.

Next there is JotSpot. JotSpot is very cool and simple. It is a web-based custom Wiki builder. You don't have to really think about data when you start building a JotSpot based application. It is ideal for collaborative tracking applications that involve unstructured data. BTW, it requires no programming skills.

The Long Tail - Overcoming the 80/20 Rule

Venture Blog has a very interesting note on the 80/20 rule based on an article from Wired.

The basic premise is that the 80/20 rule is dictated by physical constraints. Those constraints are removed when we move to alternate models enabled by the web. For example, Barnes & Nobles carries around 130,000 books based on what they believe will be the top selling books. On the other hand, 50% of Amazon's revenue comes books ranked lower than the top 130,000 books. Amazon does not have the same inventory constraints that Barnes & Nobles has and thus can carry many more books. Most of these don't sell many copies but each small sale adds up. This is the long tail.

The same can apply to business intelligence. The trend today is to provide users with pre-built reports. There are only so many pre-built reports that can be provided economically by an IT organization. On the other hand, BI tools such as Oracle Discoverer have evolved so that they are simple to learn and use while at the same time can be delivered over the web and managed centrally. I believe that we are at a point, where IT organizations should really consider empowering their end-users with more capabilities and thus serving the long tail.

PL/SQL Market Basket Analysis

DBMS_FREQUENT_ITEMSET is the the PL/SQL package that powers Oracle Data Mining's Association Rules engine which is used for market basket analysis.

This article by Lucas Jellema nicely outlines how to use this function.

SQL Regular Expressions

If you are a Perl, Python or Java developer, you already know the power of regular expressions. The good news is that Oracle 10g supports regular expressions. The syntax is slightly different from what you may have encountered before but it is very easy to adapt if you are already familiar with regular expression syntax.

Here is a very simple example of extracting the domain name from an e-mail address.

select regexp_substr('sshah@whizdog.com', '@([[:alnum:]]+[_|-|.]*[[:alnum:]]*)') from dual

which results '@whizdog.com' being returned.

For more info, please read this excellent article.

Computing Interquartile Range

There are two ways of measuring the spread of a data distribution. Standard Deviation measures the spread of a data distribution from the mean and is thus influenced by outliers. Interquartile Range (IQR) measures the spread of a data distribution from the median and is thus not influenced by outliers.

When you look at boxplots, the box represents the IQR.

The formula for IQR is simple:

IQR = Q3 - Q1

where Q3 is the 75th Quantile and Q1 is the 25th Quantile. You can used the SUMMARY function provided by the PL/SQL package DBMS_STAT_FUNCS to compute Q3 and Q1. Unfortunately, the DBMS_STAT_FUNCS package does not provide a way to subset the data.

So, today we will use SQL to compute Q3 and Q1. First, divide the data set into a high and low group split at the median. Q3 is the median of the high group and Q1 is the median of the low group.

The only catch is that if the data set has an odd number of values, we have to include the median of the entire data set in the high and low groups.

I am going to use the Oracle sample HR data set to illustrate IQR computation.

Step 1: Compute the median of the data set and determine whether the number of elements in the data set is even or odd

select median(salary), mod(count(1), 2) as isOdd from employees

The median salary is $6,200 and isOdd is 1 which means that we have an odd number values in our data set. There are 107 employees in this data set.

Step 2: Compute Q1

select median(salary) as Q1 from employees where salary <= 6200

If we had an even number of data values, we would do the following:

select median(salary) as Q1 from employees where salary <>

Step 3: Compute Q3

select median(salary) as Q3 from employees where salary >= 6200

If we had an even number of data values, we would do the following:

select median(salary) as Q3 from employees where salary > 6200

Step 4: Compute IQR

Step 2 and Step 3 result in a Q1 of $3,100 and a Q3 of $8,900. Thus,

IQR = Q3 - Q1 = 8900 - 3100 = 5800

Creating histograms using the SQL function WIDTH_BUCKET

Histograms are one of the mostly commonly used statistical tools for studying data distributions. Statistician use histograms to visualize the shape of the data set and determine if there are gaps or clusters in the data set.

Histogram are actually just specialized bar charts. To construct a histogram, divide the data range into a number of equal width buckets and then count the number of data points in each bucket. The real art of creating a histogram is choosing the number of buckets. More on that later.

Below is a histogram that shows the distribution of employee monthly salaries of the employees in the HR table.

This histogram has 11 data range buckets. The histogram tells us that most of our employees have monthly salaries between $2,100 and $4,290. We have no employees with monthly salaries between $17,430 and $24,000. We have one outlier employee (probably the CEO) who has a monthly salary greater than or equal to $24,000.

So, let's assume we wanted to compute this histogram using Oracle SQL. Well, that's easy. Oracle provides a SQL function called WIDTH_BUCKET that tells us which bucket a data point belongs to.

For example, the following snippet of SQL divides the HR schema employee monthly salary data range from $2,100 to $24,000 into 10 equal width buckets and assigns each data point to one of those buckets.

select employee_id, salary, WIDTH_BUCKET(salary, 2100, 24000, 10) from employees

Thus, the width of each bucket will be (24,000 - 2,100)/10 = 2,190. The first bucket will range from $2,100 to $4,290. There are 46 employees that fall into this bucket. The data range for the buckets is inclusive at the low end and exclusive at the high end.

select count(1) from employees where salary >= 2100 and < 4590

Now, comes the fun part. Let's use WIDTH_BUCKET to compute the distribution of employee salaries.

select bucket, 2100+(bucket-1)*2190 as low, 2100+bucket*2190 as high , count(1) as cnt
from
(
select width_bucket(salary, 2100, 24000, 10) as bucket from employees
)
group by bucket order by bucket


BUCKET        LOW       HIGH        CNT
---------- ---------- ---------- ----------
1            2100       4290         46
2            4290       6480         10
3            6480       8670         23
4            8670      10860         15
5           10860      13050          8
6           13050      15240          2
7           15240      17430          2
11           24000      26190          1

Wait a minute, there are 11 buckets in the results but we only asked for 10 buckets. What happened? Remember that the lower bucket bound is inclusive and the upper bucket bound is exclusive. We have an employee that make $24,000 per month. He can't be counted in the $21,810 to $24,000 bucket. We have to add $24,000-$26,190 bucket to accommodate him. Oracle handles this automatically and creates a 11^th bucket even though we requested 10 buckets.

I mentioned earlier that the choice of the number buckets influences the effectiveness of the histogram. There is no single best way for choosing the number of buckets. I recommend the following two heuristics.

Number of buckets = 1 + 3.3 x log(N) for data sets which N >= 15

or

Number of buckets = (max value - min value) / (2 x IQR x n^1/3)

N is the number of data points in the data set.
IQR is the interquartile range. i.e. (75th quartile value - 25th quartile value). The 25th and 75th quartile values can be compute using DBMS_STATS_FUNC.SUMMARY.

You can read "How Wide is Your Bin" for a good summary of different techniques for choosing the number of buckets. B.T.W., histogram buckets are also commonly referred to as bins by some people.

This website has a Java applet that let's you vary the number of histogram buckets to see the effects on an histogram. It's definitely worth exploring.

On a different note: The Universe Is A Strange Place

Worth checking out, Frank Wilczeck's talk on strong force.

Computing Statistical Summaries using PL/SQL DBMS_STAT_FUNCS package

The PL/SQL DBMS_STAT_FUNCS package can be used to compute statistical summary of numerical column of a table.

The following PL/SQL code fragemnt summarizes employee salaries in the Oracle 10g HR schema.

DECLARE
sigma number := 3;
S DBMS_STAT_FUNCS.summaryType;
item number;
cnt number;
BEGIN
DBMS_STAT_FUNCS.SUMMARY ('hr', 'employees', 'salary', sigma, S);
DBMS_OUTPUT.PUT_LINE('COUNT = ' || S.count);
DBMS_OUTPUT.PUT_LINE('MIN = ' || S.min);
DBMS_OUTPUT.PUT_LINE('MAX = ' || S.max);
DBMS_OUTPUT.PUT_LINE('RANGE = ' || S.range);
DBMS_OUTPUT.PUT_LINE('VARIANCE = ' || S.variance);
DBMS_OUTPUT.PUT_LINE('STDDEV = ' || S.stddev);
DBMS_OUTPUT.PUT_LINE('5th QUANTILE = ' || S.quantile_5);
DBMS_OUTPUT.PUT_LINE('25th QUANTILE = ' || S.quantile_25);
DBMS_OUTPUT.PUT_LINE('MEDIAN = ' || S.median);
DBMS_OUTPUT.PUT_LINE('75th QUANTILE = ' || S.quantile_75);
DBMS_OUTPUT.PUT_LINE('95th QUANTILE = ' || S.quantile_95);
DBMS_OUTPUT.PUT_LINE('PLUS X SIGMA = ' || S.plus_x_sigma);
DBMS_OUTPUT.PUT_LINE('MINUS X SIGMA = ' || S.minus_x_sigma);

FOR item IN S.cmode.FIRST..S.cmode.LAST
LOOP
DBMS_OUTPUT.PUT_LINE('MODE [' || item || '] = ' || S.cmode(item));
END LOOP;

FOR item IN S.top_5_values.FIRST..S.top_5_values.LAST
LOOP
DBMS_OUTPUT.PUT_LINE('TOP ' || item || ' VALUE = ' || S.top_5_values(item));
END LOOP;

cnt := S.bottom_5_values.LAST;
FOR item IN S.bottom_5_values.FIRST..S.bottom_5_values.LAST
LOOP
DBMS_OUTPUT.PUT_LINE('BOTTOM ' || cnt || ' VALUE = ' || S.bottom_5_values(item));
cnt := cnt - 1;
END LOOP;

END;
/

The DBMS_STAT_FUNCS.SUMMARY() function's argument signature is:

DBMS_STAT_FUNCS.SUMMARY (
ownername IN VARCHAR2,
tablename IN VARCHAR2,
columnname IN VARCHAR2,
sigma_value IN NUMBER DEFAULT 3,
s OUT SummaryType);

SummaryType is an object defined by the DBMS_STAT_FUNCS package. It's structure is:

TYPE n_arr IS VARRAY(5) of NUMBER;
TYPE num_table IS TABLE of NUMBER;
TYPE summaryType IS RECORD (
count NUMBER,
min NUMBER,
max NUMBER,
range NUMBER,
mean NUMBER,
cmode num_table,
variance NUMBER,
stddev NUMBER,
quantile_5 NUMBER,
quantile_25 NUMBER,
median NUMBER,
quantile_75 NUMBER,
quantile_95 NUMBER,
plus_x_sigma NUMBER,
minus_x_sigma NUMBER,
extreme_values num_table,
top_5_values n_arr,
bottom_5_values n_arr);

The limitation of the DBMS_STAT_FUNCS.SUMMARY() function is that there is no way to subset the data. The SUMMARY() function computes the summary across all rows in the table for the specified column.

The DBMS_STAT_FUNCS package also provides exponentional, normal, poisson, uniform and weibull distribution fitting functions. We will explore those in the future.

Computing Statistical Summaries using SQL Functions

We will be using the sample data sets distributed with Oracle 10g for today's exercise. In particular we will be using the HR schema which is a very simple human resources data set that tracks information such as employee names, salaries, hire dates, managers and departments.

Our task is simple. We are going to analyze the salaries of employees in the HR schema. Here is the structure of the employee table:

create table (
employee_id number(6) not null,
first_name varchar2(20),
last_name varchar2(25) not null,
email varchar2(25) not null,
phone_number varchar2(20),
hire_date date not null,
job_id varchar2(10) not null,
salary number(8,2),
commission_pct number(2,2),
manager_id number(6),
department_id number(4)
);

We can compute the minimum, maximum, count, median, mean, standard deviation and variance of salary across all employees simply by using Oracle SQL functions as show in the following SQL statement

select min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees

which results in the following summary:

MIN(SALARY)	MAX(SALARY)	COUNT(SALARY)	MEDIAN(SALARY)	STDDEV(SALARY)	VARIANCE(SALARY)
2100	24000	107	6200	3909.36575	15283140.5

Now let's do something more interesting. Let's summarize by department id.

select department_id, min(salary), max(salary), count(salary),
median(salary), stddev(salary), variance(salary) from employees group by department_id

Of course, department ids are not very meaningful to us. We would rather have the data summarized by department name. We can create a nested query against thedepartments and employees tables to do this.

select department_name, min(salary), max(salary),
count(salary), median(salary), stddev(salary), variance(salary)
from
(select department_name, salary from employees, departments
where
departments.department_id= employees.department_id)
group by department_name
order by department_name

There is actually a much simpler way to compute the same summary using Oracle's analytical SQL extensions.

select distinct department_name,
min(salary) over (partition by employees.department_id) as min,
max(salary) over (partition by employees.department_id) as max,
count(salary) over (partition by employees.department_id) as count,
median(salary) over (partition by employees.department_id) as median,
stddev(salary) over (partition by employees.department_id) as stddev,
variance(salary) over (partition by employees.department_id) as variance
from
employees, departments
where
departments.department_id= employees.department_id
order by department_name

Let's take a closer look at this query. The PARTITION BY clause breaks the entire data set into a set of groups such that each group has the same department id. Effectively, we have grouped the data by department id.

The data set operated on by the PARTITION BY clause is defined by the WHERE clause of the query. The WHERE clause of the query will return all employees that have been assigned a department. (Note: Your HR data set may contain some entries for employees that do not belong to any department. These employees might be the executives).

The query summarizes salary across each of the groups using the PARTITION BY clause. Thus, the query summarizes salary by department id for each employee.

Some readers may be wondering why we used the DISTINCT clause. Without the distinct clause, the query would have returned a row for each employee with the data summarized by department id. The DISTINCT clause eliminates the duplicates and thus only returns the summaries for each department.

DEPARTMENT_NAME	MIN	MAX	COUNT	MEDIAN	STDDEV	VARIANCE
Accounting	8300	12000	2	10150	2616.29509	6845000
Administration	4400	4400	1	4400	0	0
Executive	17000	24000	3	17000	4041.45188	16333333.3
Finance	6900	12000	6	8000	1801.11077	3244000
Human Resources	6500	6500	1	6500	0	0
IT	4200	9000	5	4800	1925.61678	3708000
Marketing	6000	13000	2	9500	4949.74747	24500000
Public Relations	10000	10000	1	10000	0	0
Purchasing	2500	11000	6	2850	3362.58829	11307000
Sales	6100	14000	34	8900	2033.6847	4135873.44
Shipping	2100	8200	45	3100	1488.00592	2214161.62

What's to come and the HR data set

Over the next few months, I am going to explore the embedded statistical analysis capabilities of the Oracle database.

Each week, I will explore a new Oracle statistical function using the HR schema that is part Oracle database sample schemas.

From time to time, I may use my own dataset in which case, I will make that dataset downloadable.

Have fun and feel free to write to me with your comments.

Quant Meditations

Sunday, June 04, 2006

What's next in easy-to-use BI? Biggle

Wednesday, August 03, 2005

Web Databases & Applications

The Long Tail - Overcoming the 80/20 Rule

Friday, July 29, 2005

PL/SQL Market Basket Analysis

Sunday, April 17, 2005

SQL Regular Expressions

Tuesday, March 22, 2005

Computing Interquartile Range

Sunday, March 20, 2005

Creating histograms using the SQL function WIDTH_BUCKET

Saturday, March 19, 2005

On a different note: The Universe Is A Strange Place

Monday, March 14, 2005

Computing Statistical Summaries using PL/SQL DBMS_STAT_FUNCS package

Sunday, March 13, 2005

Computing Statistical Summaries using SQL Functions

Monday, March 07, 2005

What's to come and the HR data set

About Me

Previous Posts

Archives

ideas

links

reading

recommendations