Wednesday, September 17, 2014

Ignoring outliers in aggregate function

This is another aide-memoire for myself really.  I want to calculate the average load times per page for an application from timings stored in the database, and see which pages need attention. However, the stats can be skewed by the odd exceptional load that takes much longer than a typical load for reasons that are probably irrelevant to me.

Here is a fictitious example:

create table timings (id int, timing number);

insert into timings
select rownum, case when rownum=50 then 1000 else 1 end
from dual
connect by rownum <= 100;

This example has 99 timings of 1 second plus an oddity of 1000 seconds.

A simple average gives a skewed picture:

SQL> select avg(timing) from timings;

AVG(TIMING)
-----------
      10.99

It suggests that users are waiting 11 seconds on average for a page to load, when in fact it is usually 1 second.

The analytic function NTILE(n) can solve this.  This divides the set of results into n "buckets" and then tells us which bucket a particular value falls into.  If we do that with a suitable number of buckets, say 10, we will be able to exlude the highest 10% and lowest 10% of the values:

SQL> select avg(timing) from 
  2  (select timing, ntile(10) over(order by timing) bucket
  3   from timings)
  4  where bucket between 2 and 9;

AVG(TIMING)
-----------
          1


No comments: