HomeOracle Page 5 - Large Database Features In Oracle, conclusion
Analysis Functions - Oracle
When you are working with large databases with Oracle Database 10g, there are certain topics with which you need to be familiar. These topics are covered in this article, which is the second of two parts. It is excerpted from chapter nine of the book Oracle Database 10g A Beginner's Guide, written by Ian Abramson et al (McGraw-Hill/Osborne, 2004; ISBN: 0072230789).
Oracle Database 10g provides a number of ranking and statistical functions that previously would have required some pretty heavy SQL to perform, or an extract to a third-party application. In this section, we will look at the analysis function available and provide examples of their use where appropriate.
NOTE
Some of the functions in this section are based on complex statistical calculations. Don’t be worried if you are unfamiliar with these concepts. It is more important for you to know that these functions exist than it is to understand the theory behind them.
Ranking Functions
Ranking functions provide the ability to rank a row of a query result relative to the other rows in the result set. Common examples of uses for these functions include identifying the top ten selling products for a period, or classifying or grouping a salesperson’s commissions into one of four buckets. The ranking functions included in Oracle Database 10g are
rank
dense_rank
cume_dist
percent_rank
ntile
row_number
rank and dense_rank Functions The simplest ranking functions are rank and dense_rank. These functions are very similar and determine the ordinal position of each row within the query result set. The difference between these two functions is that rank will leave a gap in the sequence when there is a tie for position, whereas dense_rank does not leave a gap. The results of the following listing illustrate the difference between the two.
As you can see, the ordinal position 6 does not exist as a value for rank, but it does for dense_rank. If from this result set you wanted to see the top ten listings for prod_id, you would use the original query as a subquery, as in the following listing:
select * from (select prod_id, sum(quantity_sold), rank () over (order by sum(quantity_sold) desc) as rank, dense_rank () over (order by sum(quantity_sold) desc) as dense_rank from sales where to_char(time_id, 'yyyy-mm') = '2001-06' group by prod_id) where rank < 11;
To see the bottom ten prod_ids, use the same query, but change the order by option from descending (desc) to ascending (asc). The next two functions, cume_dist and percent_rank, are statistical in nature but still part of the family of ranking functions.
cume_dist and percent_rank Functions The cume_dist function calculates the cumulative distribution of a value in a group of values. Since this is not a statistics beginner’s guide, we will not attempt to provide the theoretical background on cumulative distribution. However, we can offer these points:
The range of values returned by cume_dist is always between 0 and 1.
The value returned by cume_dist is always the same in the case of tie values in the group.
The formula for cumulative distribution is
# of rows with values = value of row being evaluated # of rows being evaluated
Looking at the query we used when discussing rank, the following listing calculates the cumulative distribution for quantity_sold. The results immediately follow the listing.
The percent_rank function is similar to the cume_dist function, but calculates a percentage ranking of a value relative to its group. Again, without getting into the theory, we can make some points about percent_rank:
The range of values returned by the function is always between 0 and 1.
The row with a rank of 1 will have a percent rank of 0.
The formula for calculating percent rank is
rank of row within its group – 1 # of rows in the group – 1
The next listing and its results demonstrate the percent_rank function using the base query we have been using in this section.
select prod_id, sum(quantity_sold), rank () over (order by sum(quantity_sold) desc) as rank, percent_rank () over (order by sum(quantity_sold) asc) as percent_rank from sales where to_char(time_id, 'yyyy-mm') = '2001-06' group by prod_id order by sum(quantity_sold) desc;
PROD_ID
SUM(QUANTITY_SOLD)
RANK
PERCENT_RANK
--------------
----------------------
------
-------------------
24
762
1
1
30
627
2
.985714286
147
578
3
.971428571
33
552
4
.957142857
40
550
5
.928571429
133
550
5
.928571429
48
541
7
.914285714
120
538
8
.9
23
535
9
.885714286
119
512
10
.871428571
124
503
11
.857142857
140
484
12
.842857143
148
472
13
.828571429
139
464
14
.814285714
123
459
15
.8
131
447
16
.785714286
25
420
17
.771428571
135
415
18
.757142857
137
407
19
.742857143
146
401
20
.728571429
The ntile Function The ntile function divides a result set into a number of buckets specified at query time by the user, and then assigns each row in the result set a bucket number. The most common number of buckets used are 3 (tertiles), 4 (quartiles), and 10 (deciles). Each bucket will have the same number of rows, except in the case when the number of rows does not divide evenly by the number of buckets. In this case, each of the leftover rows will be assigned to buckets with the lowest bucket numbers until all leftover rows are assigned. For example, if four buckets were specified and the number of rows in the result set was 98, buckets 1 and 2 would have 25 rows each and buckets 3 and 4 would have 24 rows each.
Let’s look at an example. Using our base query of amount_sold in the SALES table, we want to look at amount_sold by product subcategory and rank the amounts into four buckets. Here’s the SQL:
select b.prod_subcategory, sum(a.quantity_sold), ntile(4) over (ORDER BY SUM(a.quantity_sold) desc) as quartile from sales a, products b where a.prod_id = b.prod_id and to_char(a.time_id, 'yyyy-mm') = '2001-06' group by b.prod_subcategory;
As you can see in the following results, the number of product subcategories was not evenly divisible by the number of buckets specified (in this case, 4). Therefore, six subcategories were assigned to the first quartile (bucket 1) and five subcategories were assigned to the second, third, and fourth quartiles.