Home arrow Oracle arrow Page 5 - Large Database Features In Oracle, conclusion

Analysis Functions - Oracle

When you are working with large databases with Oracle Database 10g, there are certain topics with which you need to be familiar. These topics are covered in this article, which is the second of two parts. It is excerpted from chapter nine of the book Oracle Database 10g A Beginner's Guide, written by Ian Abramson et al (McGraw-Hill/Osborne, 2004; ISBN: 0072230789).

TABLE OF CONTENTS:
  1. Large Database Features In Oracle, conclusion
  2. RAC Architecture
  3. Self-Managing Databases
  4. cube
  5. Analysis Functions
  6. Windowing Functions
  7. Project 9-2 Using Analytic SQL Functions and Models
By: McGraw-Hill/Osborne
Rating: starstarstarstarstar / 6
December 15, 2005

print this article
SEARCH DEV SHED

TOOLS YOU CAN USE

advertisement

Oracle Database 10g provides a number of ranking and statistical functions that previously would have required some pretty heavy SQL to perform, or an extract to a third-party application. In this section, we will look at the analysis function available and provide examples of their use where appropriate.


NOTE

Some of the functions in this section are based on complex statistical calculations. Donít be worried if you are unfamiliar with these concepts. It is more important for you to know that these functions exist than it is to understand the theory behind them.


Ranking Functions

Ranking functions provide the ability to rank a row of a query result relative to the other rows in the result set. Common examples of uses for these functions include identifying the top ten selling products for a period, or classifying or grouping a salespersonís commissions into one of four buckets. The ranking functions included in Oracle Database 10g are

  1. rank
  2. dense_rank
  3. cume_dist
  4. percent_rank
  5. ntile
  6. row_number

rank and dense_rank Functions  The simplest ranking functions are rank and dense_rank. These functions are very similar and determine the ordinal position of each row within the query result set. The difference between these two functions is that rank will leave a gap in the sequence when there is a tie for position, whereas dense_rank does not leave a gap. The results of the following listing illustrate the difference between the two.

select prod_id,
       sum(quantity_sold),
       rank () over (order by sum(quantity_sold) desc) as rank,
       dense_rank () over (order by sum(quantity_sold) desc) as
dense_rank
from sales
where to_char(time_id, 'yyyy-mm') = '2001-06'
group by prod_id;
PROD_ID SUM(QUANTITY_SOLD) RANK  DENSE_RANK
------ ---------------- ---- ---------
     24                 762   1          1 
     30                 627   2          2
    147                 578   3          3
    
33                 552   4          4
    
40                 550   5          5
   
133                 550   5          5
    
48                 541   7          6
   
120                 538   8          7
    
23                 535   9          8
   
119                 512  10          9
   
124                 503  11         10
   
140                 484  12         11
   
148                 472  13         12
   
139                 464  14         13
   
123                 459  15         14
   
131                 447  16         15
    
25                 420  17         16
   
135                 415  18         17
   
137                 407  19         18
   
146                 401  20         19

As you can see, the ordinal position 6 does not exist as a value for rank, but it does for dense_rank. If from this result set you wanted to see the top ten listings for prod_id, you would use the original query as a subquery, as in the following listing:

select * from
 
(select prod_id,
        
sum(quantity_sold),
        
rank () over (order by sum(quantity_sold) desc) as rank,
        
dense_rank () over (order by sum(quantity_sold) desc) as
dense_rank
  from sales
  where to_char(time_id, 'yyyy-mm') = '2001-06'
  group by prod_id)
where rank < 11;

To see the bottom ten prod_ids, use the same query, but change the order by option from descending (desc) to ascending (asc). The next two functions, cume_dist and percent_rank, are statistical in nature but still part of the family of ranking functions.

cume_dist and percent_rank Functions  The cume_dist function calculates the cumulative distribution of a value in a group of values. Since this is not a statistics beginnerís guide, we will not attempt to provide the theoretical background on cumulative distribution. However, we can offer these points:

  • The range of values returned by cume_dist is always between 0 and 1.
  • The value returned by cume_dist is always the same in the case of tie values in the group.
  • The formula for cumulative distribution is

# of rows with values = value of row being evaluated 
                     # of rows being evaluated

Looking at the query we used when discussing rank, the following listing calculates the cumulative distribution for quantity_sold. The results immediately follow the listing.

select prod_id,
   sum(quantity_sold),
   cume_dist () over (order by sum(quantity_sold) asc) as
cume_dist
from sales
where to_char(time_id, 'yyyy-mm') = '2001-06'
group by prod_id
order by sum(quantity_sold) desc;
PROD_ID SUM(QUANTITY_SOLD)  CUME_DIST
------- ------------------ -----------
     24                762          1
     30                627 .985915493
    147                578 .971830986
     33                552 .957746479
     40                550 .943661972
    133                550 .943661972
     48                541 .915492958
    120                538 .901408451
    
23                535 .887323944
    119                512 .873239437
    124                503 .85915493
    140                484 .845070423
    148                472 .830985915
    139                464 .816901408
    123                459 .802816901
    131                447 .788732394
    
25                420 .774647887
    135                415 .76056338
    137                407 .746478873
    146                401 .732394366

The percent_rank function is similar to the cume_dist function, but calculates a percentage ranking of a value relative to its group. Again, without getting into the theory, we can make some points about percent_rank:

  • The range of values returned by the function is always between 0 and 1.
  • The row with a rank of 1 will have a percent rank of 0.
  • The formula for calculating percent rank is

rank of row within its group Ė 1
   # of rows in the group Ė 1

The next listing and its results demonstrate the percent_rank function using the base query we have been using in this section.

select prod_id,
       sum(quantity_sold),
       rank () over (order by sum(quantity_sold) desc) as rank,
       percent_rank ()
              
over (order by sum(quantity_sold) asc) as
percent_rank
from sales
where to_char(time_id, 'yyyy-mm') = '2001-06'
group by prod_id
order by sum(quantity_sold) desc;

PROD_ID

SUM(QUANTITY_SOLD)

RANK

PERCENT_RANK

--------------

----------------------

------

-------------------

24

762

1

1

30

627

2

.985714286

147

578

3

.971428571

33

552

4

.957142857

40

550

5

.928571429

133

550

5

.928571429

48

541

7

.914285714

120

538

8

.9

23

535

9

.885714286

119

512

10

.871428571

124

503

11

.857142857

140

484

12

.842857143

148

472

13

.828571429

139

464

14

.814285714

123

459

15

.8

131

447

16

.785714286

25

420

17

.771428571

135

415

18

.757142857

137

407

19

.742857143

146

401

20

.728571429

 

The ntile Function  The ntile function divides a result set into a number of buckets specified at query time by the user, and then assigns each row in the result set a bucket number. The most common number of buckets used are 3 (tertiles), 4 (quartiles), and 10 (deciles). Each bucket will have the same number of rows, except in the case when the number of rows does not divide evenly by the number of buckets. In this case, each of the leftover rows will be assigned to buckets with the lowest bucket numbers until all leftover rows are assigned. For example, if four buckets were specified and the number of rows in the result set was 98, buckets 1 and 2 would have 25 rows each and buckets 3 and 4 would have 24 rows each.

 

Letís look at an example. Using our base query of amount_sold in the SALES table, we want to look at amount_sold by product subcategory and rank the amounts into four buckets. Hereís the SQL:

select b.prod_subcategory,
      
sum(a.quantity_sold),
      
ntile(4) over (ORDER BY SUM(a.quantity_sold) desc) as
quartile
from sales a, products b
where a.prod_id = b.prod_id
and to_char(a.time_id, 'yyyy-mm') = '2001-06'
group by b.prod_subcategory;

As you can see in the following results, the number of product subcategories was not evenly divisible by the number of buckets specified (in this case, 4). Therefore, six subcategories were assigned to the first quartile (bucket 1) and five subcategories were assigned to the second, third, and fourth quartiles.

PROD_SUM  
SUBCATEGORY   (A.QUANTITY_SOLD)   QUARTILE   ROWNUMBER   
---------------------------------------
Accessories  323011 
Y Box Games  257212 
Recordable CDs  227813 
Camera Batteries  219214 
Recordable DVD Discs211515
Documentation193116
Modems/Fax131427
CD-ROM107628
Y Box Accessories105029 
Printer Supplies    956210 
Memory748211
Camera Media664312
Home Audio370313
Game Consoles  352314
Operating Systems343315
Bulk Pack Diskettes270316
Portable PCs215417
Desktop PCs214418
Camcorders196419
Monitors178420
Cameras173421



 
 
>>> More Oracle Articles          >>> More By McGraw-Hill/Osborne
 

blog comments powered by Disqus
escort Bursa Bursa escort Antalya eskort
   

ORACLE ARTICLES

- Oracle Java Security Woes Continue
- Oracle's New IaaS Cloud Option: There's a Ca...
- Oracle Acquires Eloqua to Boost Cloud Presen...
- Choosing Innovation: Oracle Survey Insights
- Oracle Fixes Privilege Escalation Bug
- Oracle`s Communications Service Availability...
- Oracle Releases Exalytics, Taleo Plans
- Oracle Releases Communications Network Integ...
- Oracle Releases Communications Data Model 11...
- Oracle Releases PeopleSoft PeopleTools 8.52
- Oracle Integrates Cloudera Apache Distro, My...
- Oracle Releases MySQL 5.5.18
- Oracle Announces NoSQL Database Availability
- Sorting Database Columns With the SELECT Sta...
- Retrieving Table Data with the LIKE Operator

Developer Shed Affiliates

 


Dev Shed Tutorial Topics: