Large Database Features In Oracle, conclusion - Analysis Functions (Page 5 of 7 )
Oracle Database 10g provides a number of ranking and statistical functions that previously would have required some pretty heavy SQL to perform, or an extract to a third-party application. In this section, we will look at the analysis function available and provide examples of their use where appropriate.
NOTE Some of the functions in this section are based on complex statistical calculations. Don’t be worried if you are unfamiliar with these concepts. It is more important for you to know that these functions exist than it is to understand the theory behind them.
Ranking Functions
Ranking functions provide the ability to rank a row of a query result relative to the other rows in the result set. Common examples of uses for these functions include identifying the top ten selling products for a period, or classifying or grouping a salesperson’s commissions into one of four buckets. The ranking functions included in Oracle Database 10g are
- rank
- dense_rank
- cume_dist
- percent_rank
- ntile
- row_number
rank and dense_rank Functions The simplest ranking functions are rank and dense_rank. These functions are very similar and determine the ordinal position of each row within the query result set. The difference between these two functions is that rank will leave a gap in the sequence when there is a tie for position, whereas dense_rank does not leave a gap. The results of the following listing illustrate the difference between the two.
select prod_id,
sum(quantity_sold),
rank () over (order by sum(quantity_sold) desc) as rank,
dense_rank () over (order by sum(quantity_sold) desc) as
dense_rank
from sales
where to_char(time_id, 'yyyy-mm') = '2001-06'
group by prod_id;
PROD_ID SUM(QUANTITY_SOLD) RANK DENSE_RANK
------ ---------------- ---- ---------
24 762 1 1
30 627 2 2
147 578 3 3
33 552 4 4
40 550 5 5
133 550 5 5
48 541 7 6
120 538 8 7
23 535 9 8
119 512 10 9
124 503 11 10
140 484 12 11
148 472 13 12
139 464 14 13
123 459 15 14
131 447 16 15
25 420 17 16
135 415 18 17
137 407 19 18
146 401 20 19
As you can see, the ordinal position 6 does not exist as a value for rank, but it does for dense_rank. If from this result set you wanted to see the top ten listings for prod_id, you would use the original query as a subquery, as in the following listing:
select * from
(select prod_id,
sum(quantity_sold),
rank () over (order by sum(quantity_sold) desc) as rank,
dense_rank () over (order by sum(quantity_sold) desc) as
dense_rank
from sales
where to_char(time_id, 'yyyy-mm') = '2001-06'
group by prod_id)
where rank < 11;
To see the bottom ten prod_ids, use the same query, but change the order by option from descending (desc) to ascending (asc). The next two functions, cume_dist and percent_rank, are statistical in nature but still part of the family of ranking functions.
cume_dist and percent_rank Functions The cume_dist function calculates the cumulative distribution of a value in a group of values. Since this is not a statistics beginner’s guide, we will not attempt to provide the theoretical background on cumulative distribution. However, we can offer these points:
- The range of values returned by cume_dist is always between 0 and 1.
- The value returned by cume_dist is always the same in the case of tie values in the group.
- The formula for cumulative distribution is
# of rows with values = value of row being evaluated
# of rows being evaluated
Looking at the query we used when discussing rank, the following listing calculates the cumulative distribution for quantity_sold. The results immediately follow the listing.
select prod_id,
sum(quantity_sold),
cume_dist () over (order by sum(quantity_sold) asc) as
cume_dist
from sales
where to_char(time_id, 'yyyy-mm') = '2001-06'
group by prod_id
order by sum(quantity_sold) desc;
PROD_ID SUM(QUANTITY_SOLD) CUME_DIST
------- ------------------ -----------
24 762 1
30 627 .985915493
147 578 .971830986
33 552 .957746479
40 550 .943661972
133 550 .943661972
48 541 .915492958
120 538 .901408451
23 535 .887323944
119 512 .873239437
124 503 .85915493
140 484 .845070423
148 472 .830985915
139 464 .816901408
123 459 .802816901
131 447 .788732394
25 420 .774647887
135 415 .76056338
137 407 .746478873
146 401 .732394366
The percent_rank function is similar to the cume_dist function, but calculates a percentage ranking of a value relative to its group. Again, without getting into the theory, we can make some points about percent_rank:
- The range of values returned by the function is always between 0 and 1.
- The row with a rank of 1 will have a percent rank of 0.
- The formula for calculating percent rank is
rank of row within its group – 1
# of rows in the group – 1
The next listing and its results demonstrate the percent_rank function using the base query we have been using in this section.
select prod_id,
sum(quantity_sold),
rank () over (order by sum(quantity_sold) desc) as rank,
percent_rank ()
over (order by sum(quantity_sold) asc) as
percent_rank
from sales
where to_char(time_id, 'yyyy-mm') = '2001-06'
group by prod_id
order by sum(quantity_sold) desc;
PROD_ID | SUM(QUANTITY_SOLD) | RANK | PERCENT_RANK |
-------------- | ---------------------- | ------ | ------------------- |
24 | 762 | 1 | 1 |
30 | 627 | 2 | .985714286 |
147 | 578 | 3 | .971428571 |
33 | 552 | 4 | .957142857 |
40 | 550 | 5 | .928571429 |
133 | 550 | 5 | .928571429 |
48 | 541 | 7 | .914285714 |
120 | 538 | 8 | .9 |
23 | 535 | 9 | .885714286 |
119 | 512 | 10 | .871428571 |
124 | 503 | 11 | .857142857 |
140 | 484 | 12 | .842857143 |
148 | 472 | 13 | .828571429 |
139 | 464 | 14 | .814285714 |
123 | 459 | 15 | .8 |
131 | 447 | 16 | .785714286 |
25 | 420 | 17 | .771428571 |
135 | 415 | 18 | .757142857 |
137 | 407 | 19 | .742857143 |
146 | 401 | 20 | .728571429 |
The ntile Function The ntile function divides a result set into a number of buckets specified at query time by the user, and then assigns each row in the result set a bucket number. The most common number of buckets used are 3 (tertiles), 4 (quartiles), and 10 (deciles). Each bucket will have the same number of rows, except in the case when the number of rows does not divide evenly by the number of buckets. In this case, each of the leftover rows will be assigned to buckets with the lowest bucket numbers until all leftover rows are assigned. For example, if four buckets were specified and the number of rows in the result set was 98, buckets 1 and 2 would have 25 rows each and buckets 3 and 4 would have 24 rows each.
Let’s look at an example. Using our base query of amount_sold in the SALES table, we want to look at amount_sold by product subcategory and rank the amounts into four buckets. Here’s the SQL:
select b.prod_subcategory,
sum(a.quantity_sold),
ntile(4) over (ORDER BY SUM(a.quantity_sold) desc) as
quartile
from sales a, products b
where a.prod_id = b.prod_id
and to_char(a.time_id, 'yyyy-mm') = '2001-06'
group by b.prod_subcategory;
As you can see in the following results, the number of product subcategories was not evenly divisible by the number of buckets specified (in this case, 4). Therefore, six subcategories were assigned to the first quartile (bucket 1) and five subcategories were assigned to the second, third, and fourth quartiles.
| PROD_ | SUM | | |
| SUBCATEGORY | (A.QUANTITY_SOLD) | QUARTILE | ROWNUMBER | |
| -------------------- | --------- | ----- | ----- |
| Accessories | 3230 | 1 | 1 | |
| Y Box Games | 2572 | 1 | 2 | |
| Recordable CDs | 2278 | 1 | 3 | |
| Camera Batteries | 2192 | 1 | 4 | |
| Recordable DVD Discs | 2115 | 1 | 5 |
| Documentation | 1931 | 1 | 6 |
| Modems/Fax | 1314 | 2 | 7 |
| CD-ROM | 1076 | 2 | 8 |
| Y Box Accessories | 1050 | 2 | 9 | |
| Printer Supplies | 956 | 2 | 10 | |
| Memory | 748 | 2 | 11 |
| Camera Media | 664 | 3 | 12 |
| Home Audio | 370 | 3 | 13 |
| Game Consoles | 352 | 3 | 14 |
| Operating Systems | 343 | 3 | 15 |
| Bulk Pack Diskettes | 270 | 3 | 16 |
| Portable PCs | 215 | 4 | 17 |
| Desktop PCs | 214 | 4 | 18 |
| Camcorders | 196 | 4 | 19 |
| Monitors | 178 | 4 | 20 |
| Cameras | 173 | 4 | 21 |
Next: Windowing Functions >>
More Oracle Articles
More By McGraw-Hill/Osborne
|
This article is excerpted from chapter nine of the book Oracle Database 10g A Beginner's Guide, written by Ian Abramson et al (Mcgraw-Hill/Osborne, 2004; ISBN: 0072230789). Check it out at your favorite bookstore. Buy this book now.
|
|