Large Database Features In Oracle, conclusion

When you are working with large databases with Oracle Database 10g, there are certain topics with which you need to be familiar. These topics are covered in this article, which is the second of two parts. It is excerpted from chapter nine of the book Oracle Database 10g A Beginner’s Guide, written by Ian Abramson et al (McGraw-Hill/Osborne, 2004; ISBN: 0072230789).

Query Rewrite

Earlier, we stated that one of the benefits of using materialized views was that they are transparent to the users. But what exactly does that mean and how can they be

FIGURE 9-8.  Summarization using materialized views

used if the users can’t see them? In fact, because materialized views are so much like tables, you can give the users access to materialized views, but generally this is not done.

Instead, as indicated in Figure 9-8, the users always query the tables with the detail data—they don’t usually query the materialized views directly. This is because the query optimizer in Oracle Database 10g knows about the materialized views and their relationships to the detail tables and can rewrite the query on the fly to access the materialized views instead. This results in huge performance gains without the user having to do anything special—just query the detail data. There is a maintenance benefit of this feature for the user as well: the queries do not have to change to point to different summary tables, as is the case with the more traditional summarization approach.

In order for the query to be rewritten, the structure of the materialized view must satisfy the criteria of the query. The following two listings demonstrate the query rewrite process. Let’s assume we need to summarize the COMMISSION table we saw in the data compression section using the following query:

select prod_id, to_char(comm_date, ‘YYYY-MM’), count(*), sum(comm_amt)
from commission
group by prod_id, to_char(comm_date, ‘YYYY-MM’);

Assume further that a materialized view (called comm_prod_mv) exists that contains summarized commission data by sales_rep_id, prod_id, and comm_date (full date). In this case, the query would be automatically rewritten as follows:

select prod_id, to_char(comm_date, ‘YYYY-MM’), count(*), sum(comm_amt)
from comm_prod_mv
group by prod_id, to_char(comm_date, ‘YYYY-MM’);

By rewriting the query to use the materialized view instead, a large amount of data-crunching has been saved and the results will return much more quickly. Let’s now turn our attention to determining what materialized views should be created.

When to Create Materialized Views

At this point, you may be asking yourself: “How do I determine what materialized views to create and at what level of summarization?” Oracle Database 10g has some utilities to help. These utilities are collectively called the SQLAccess Advisor and will recommend materialized views based on historical queries, or based on theoretical scenarios. They can be run from the Oracle Enterprise Manager Grid Control (OEM) or by calling the dbms_advisor package.

Create Materialized Views

Materialized views are created using a create materialized view statement, which is similar to a create table statement. This can be performed using SQL*Plus or OEM. The following listing shows a simple example of how to create the comm_prod_mv materialized view mentioned earlier and Table 9-16 provides an explanation of the syntax.

1 create materialized view comm_prod_mv
2   tablespace comm_prod_mv_ts
3   storage (initial 50k next 50k)
4   refresh complete next sysdate + 7
5   enable query rewrite
6 as select sales_rep_id, prod_id,
  comm_date, count(*), sum(comm_amt)
7   from commission
8   group by sales_rep_id, prod_id, 
  comm_date;

In the next three sections, we will be discussing some higher-level concepts: Real Application Clusters, Automatic Storage Management, and Grid Computing. But first, a progress check.

 

Lines

Important Points

2–3

Specify the tablespace and storage parameters.

4

Specifies how and when to refresh the data. In this case, the materialized view will be populated immediately and be completely refreshed every seven days thereafter.

5

Specifies that query rewrite is to be enabled.

6–8

Specify the query that will act as the source of the data.

TABLE 9-16.  Explanation of Materialized View Creation
                        Syntax

———————————————————————-

Progress Check

  1. True or False: Tables with many foreign keys are good candidates for compression.
  2. Name the two processing components involved in Oracle Database 10g’s parallel processing.
  3. What is the function of the SQLAccess Advisor?
  4. True or False: In order to access the data in a materialized view, a user or application must query the materialized view directly?
  5. List the ways in which parallel processing can be invoked.
  6. In what situation can index key compression not be used on a unique index?

———————————————————-

CRITICAL SKILL 9.6

Real Application Clusters: A Primer

When working with large databases, issues such as database availability, performance and scalability are very important. In today’s 24/7 environments, it is not usually acceptable for a database to be unavailable for any length of time—even for planned maintenance or for coping with unexpected failures. Here’s where Oracle Database 10g’s Real Application Clusters (RAC) comes in.

Originally introduced in Oracle9i and only available with the Enterprise Edition, Real Application Clusters is a feature that allows database hardware and instances to be grouped together to act as one database using a shared-disk architecture. Following is a high-level discussion on RAC’s architecture.


Progress Check Answers

  1. True.
  2. The Parallel Execution Coordinator and the Parallel Execution Servers.
  3. The SQLAccess Advisor recommends potential materialized views based on historical or theoretical scenarios.
  4. False. While the end user or application can query the materialized view directly, usually the target of a query is the detail data and Oracle’s query rewrite capabilities will automatically return the results from the materialized view instead of the detail table (assuming the materialized view meets the query criteria).
  5. Parallel processing can be invoked based on the parallelism specified for a table at the time of its creation, or by providing the parallel hint in a select query.
  6. If the unique index has only one attribute, key compression cannot be used.

 


{mospagebreak title=RAC Architecture}

A typical RAC implementation groups multiple Oracle database instances running on multiple machines (or nodes). These nodes communicate with each other and share a common pool of disks. The disks house all of the data files that comprise the database. This architecture is illustrated in Figure 9-9.

Let’s look at a few of the key components in Figure 9-9:

  • Each node is connected to each other using high bandwidth communications.
  • Each node has its own set of log (redo) files.
  • The Cluster Manager monitors the status of each database instance in the cluster and enables communication between the instances. This manager known as the Cluster Synchronization Services (CSS). In previous versions of Oracle, it was just referred to as the “Cluster Manager.” When RAC was first introduced, cluster management was performed by third-party software, which is still supported by Oracle Database 10g.

  • FIGURE 9-9.  Real Application–
                 conceptual architecture
  • The Global Cache Service controls data exchange between nodes using the Cache Fusion technology.  Cache Fusion synchronizes the memory cache in each node using high-speed communications. This allows any node to access any data in the database.
  • Shared storage consists of data and index files, as well as control files.

This architecture makes RAC systems highly available. For example, if Node 2 in Figure 9-9 fails or requires maintenance, the remaining nodes will keep the database available.

This activity is transparent to the user or application and as long as at least one node is active, all data is available. RAC architecture also allows near-linear scalability and offers increased performance benefits. New nodes can easily be added to the cluster when needed to boost performance.

Administering and maintaining data files on both RAC and single-node systems have always required a good deal of effort, especially when data partitioning is involved. Oracle’s solution to reducing this effort is Automatic Storage Management, which we will discuss next.

CRITICAL SKILL 9.7

Automatic Storage Management: Another Primer

In previous versions of Oracle and with most other databases, management of data files for large databases consumes a good portion of the DBA’s time and effort. The number of data files in large databases can easily be in the hundreds or even thousands. The DBA must coordinate and provide names for these files and then optimize the storage location of files on the disks. The new Automatic Storage Management (ASM) feature in Oracle Database 10g Enterprise Edition addresses these issues.

ASM simplifies the management of disks and data files by creating logical groupings of disks into disk groups. The DBA need only refer to the groups, not the underlying data files. Data files are automatically named and distributed evenly (striped) throughout the disks in the group for optimal throughput. As disks are added or removed from the disk group, ASM redistributes the files among the available disks, automatically, while the database is still running. ASM can also mirror data for redundancy.

ASM Architecture

When ASM is implemented, each node in the database (clustered or nonclustered) has an ASM instance and a database instance, with a communication link between them. The ASM instance manages and works with the disk groups, and the database instance works with the data files. Figure 9-10 illustrates this architecture.

The following points address the key components in Figure 9-10:

  • The ASM instance manages the disk groups and ASM files. The names of the ASM files in each disk group are system generated. The ASM instance maintains a map of data files and disks.
  • The database instance accesses the ASM files directly, not the disk groups. In order to do this, the database instance must communicate with the ASM instance to request the names and locations of ASM data files required.
  • A database can contain both ASM and non-ASM files and the database instance can access either type.
         FIGURE 9-10.  Automatic Storage
                   Management 

Ask the Expert

Q: After ASM disk groups are defined, how are they associated with a table?

A: ASM disk groups are referred to during tablespace creation, as in the following example:

1 create tablespace ts1
2   datafile +diskgrp1 /alias1;

This listing creates tablespace ts1 in disk group diskgrp1. Note that this assumes that both diskgrp1 and alias1 have previously been defined.

A table can now be created in tablespace ts1 and it will use ASM data files. While ASM can be implemented in a single-node environment, its real power and benefits are realized when used in RAC environments. This powerful
combination is the heart of Oracle Database 10g’s grid computing database architecture.

CRITICAL SKILL 9.8

Grid Computing: The “g” in Oracle Database 10g

In this chapter, we have discussed many issues and demands surrounding large databases—performance, maintenance efforts, and so on. We have also discussed the solutions offered by Oracle Database 10g. Now we will have a high-level look at Oracle Database 10g’s grid-computing capabilities.


NOTE

Oracle Database 10g’s grid computing applies to both database and application layers. We will just be scratching the surface of grid computing and focusing on the database components.


The theory behind grid computing is that all parts of the grid (databases, applications, servers, disks, and so forth) work together in a highly integrated fashion, with each component being able to react appropriately to changes in other components in the grid. This results in efficient use of resources, faster response times, high availability, and so on.

The database components of Oracle Database 10g’s grid computing provide an infrastructure that brings together multiple servers (RAC) with shared storage (ASM), all tied together using the Oracle Enterprise Manager (OEM) Grid Control. Table 9-17 shows how grid computing addresses some of the issues facing large databases.

In order for grid computing to work, a software infrastructure layer needs to be in place to control the hardware and processes, as well as provide some automated maintenance capabilities. This is mainly the function of the OEM and is a core component of the grid. In addition, the Oracle Database 10g has the capability to perform some self-management.

{mospagebreak title=Self-Managing Databases}

One of the new features of Oracle Database 10g is its self-managing capabilities. The components of this feature are as follows:

  • Automatic Workload Repository   Automatically collects workload and performance statistics to be used for all self-management activities.

 

Large Database Issue

Grid Components

Benefits

Data volumes

ASM

Reduced maintenance efforts

 

 

Improved performance through data striping

Number of users and applications

RAC

Multiple nodes available to service requests Improved performance (workload balancing)

Complexity of requests

RAC

Improved performance (workload balancing)

 

ASM

Improved performance (data striping)

Database availability

RAC

Multiple nodes available to service requests

Hardware/software costs

RAC and ASM

Use of lower-cost servers and generic disks

Resource management

OEM

Reduced maintenance efforts (aware of all grid components)


TABLE 9-17.  Large Database Issues Addressed by Grid
                        Computing

  • Unified Scheduler  Used to schedule routine administration tasks, as specified by the administrator or based on the information in the Automatic Workload Repository.
  1. Space Manageability  Includes automatic undo management, Oracle-managed files, free space management, and multiple block sizes.
  2. Backup and Recovery Manageability  Used to create reusable backup configurations, automate management of backups and logs, and ensure restart within a specified time after system failure. Includes Recovery Manager and Mean Time to Recovery.
  3. Automatic Database Diagnostic Monitor (ADDM)  Used to diagnose performance bottlenecks.
  4. Server-Generated Alerts  Alerts are sent when the database encounters or anticipates a problem.
  5. Database Resource Manager  Used to prioritize work within the database, ensuring that high priority users or jobs get the resources they need, when they need them.
  6. Advisors  Used to analyze objects and recommend actions. Advisors include
  7. Space Advisors  Segment Advisor and Undo Advisor
  8. Backup and Recovery Advisors  Logfile Size Advisor and MTTR Advisor
  9. SQLAccess Advisor
  10. SQL Tuning Advisor
  • Memory Advisors  Shared Pool Advisor, Buffer Cache Advisor, and PGA Advisor

Grid computing has the potential to offer tremendous benefits in the areas of availability, performance and maintenance, all while reducing costs. Grid computing will be the last in our discussions on design and architectural issues. We’ll take a progress check and then explore some of the analytical aspects offered by Oracle Database 10g when working with large databases.

———————————————————————-

Progress Check

  1. In an RAC system, what component connects the nodes to the shared storage?
  2. When a disk is added to an ASM disk group, what happens to the existing data in the group?
  3. True or False: A database instance in an ASM system accesses the disk groups?
  4. What component of a self-managing database contains workload and performance statistics used for self-management activities?
  5. What are the database-related components that are part of grid computing?
  6. What is the function of the Cluster Manager in RAC systems?

     

———————————————————-

CRITICAL SKILL 9.9

Use SQL Aggregate and Analysis Functions

Once your database has been loaded with data, your users or applications will, of course, want to use that data to run queries, perform analysis, produce reports, extract data, and so forth. Oracle Database 10g provides many sophisticated aggregation and analysis functions that can help ease the pain sometimes associated with analyzing data in large databases.


Progress Check Answers

  1. The Global Cache Service (or Cache Fusion) connects the nodes to the shared storage.
  2. The existing data is automatically redistributed among all disks in the disk group.
  3. False. The database instance communicates with the ASM instance to determine which ASM files to access directly. Only the ASM instance works with the disk groups.
  4. The Automatic Workload Repository contains workload and performance statistics used for self-management activities.
  5. RAC, ASM, and OEM are the database components that are part of grid computing.
  6. The Cluster Manager monitors the status of each database instance in the cluster and enables communication between the instances. 

 


Aggregation Functions

Oracle Database 10g provides extensions to the standard SQL group by clause of the select statement that generate other totals as part of the result set that previously required multiple queries, nested subqueries, or importing into spreadsheet type applications. These extensions are rollup and cube.

rollup

The rollup extension generates subtotals for attributes specified in the group by clause, plus another row representing the grand total. The following is an example of the rollup extension, using the SALES table we have seen throughout this chapter:

select c.cust_gender gender,
       b.channel_class channel_class, 
       to_char(a.time_id, ‘yyyy-mm’) month,
       count(*) unit_count,
       sum(a.amount_sold) amount_sold
from sales a, channels b, customers c
where a.channel_id = b.channel_id
and   a.cust_id = c.cust_id
and   to_char(a.time_id, ‘yyyy-mm’) between ‘2001-01′ and ‘2001-02′
group by rollup(c.cust_gender,
              
b.channel_class,
               to_char(a.time_id, ‘yyyy-mm’));

GENDER

CHANNEL_CLASS

MONTH

UNIT_COUNT

AMOUNT_SOLD

———-

———————-

———

—————-

——————-

F

Direct

2001-01

4001

387000.9

F

Direct

2001-02

3208

365860.13

F

Direct

 

7209

752861.03

F

Others

2001-01

2486

242615.9

F

Others

2001-02

2056

229633.52

F

Others

 

4542

472249.42

F

Indirect

2001-01

1053

138395.21

F

Indirect

2001-02

1470

189425.88

F

Indirect

 

2523

327821.09

F

 

 

14274

1552931.54

M

Direct

2001-01

7038

719146.28

M

Direct

2001-02

6180

641192.61

M

Direct

 

13218

1360338.89

M

Others

2001-01

4310

414603.03

M

Others

2001-02

3751

391792.61

M

Others

 

8061

806395.64

M

Indirect

2001-01

1851

211947.81

M

Indirect

2001-02

2520

285219.79

M

Indirect

 

4371

497167.6

M

 

 

25650

2663902.13

 

 

 

39924

4216833.67

In the results, we can see that counts and sums of amount_sold are returned at the following levels:

  1. By GENDER, CHANNEL_CLASS, and MONTH
  2. Subtotals by CHANNEL_CLASS within GENDER
  3. Subtotals by GENDER
  4. Grand total

{mospagebreak title=cube}

The cube extension takes rollup a step further by generating subtotals for each combination of the group by attributes, totals by attribute, and the grand total. The following is an example of the cube extension, using the same query we used for rollup:

select c.cust_gender gender,
       b.channel_class channel_class,
       to_char(a.time_id, ‘yyyy-mm’) month,
       count(*) unit_count,
       sum(a.amount_sold) amount_sold
from sales a, channels b, customers c
where a.channel_id = b.channel_id
and a.cust_id = c.cust_id
and to_char(a.time_id, ‘yyyy-mm’) between ‘2001-01′ and ‘2001-02′
group by cube(c.cust_gender,
              
b.channel_class,
               to_char(a.time_id, ‘yyyy-mm’));

GENDER

CHANNEL_CLASS

MONTH

UNIT_COUNT

AMOUNT_SOLD

———-

———————-

———

—————-

——————-

 

 

 

39924

4216833.67

 

 

2001-01

20739

2113709.13

 

 

2001-02

19185

2103124.54

 

Direct

 

20427

2113199.92

 

Direct

2001-01

11039

1106147.18

 

Direct

2001-02

9388

1007052.74

 

Others

 

12603

1278645.06

 

Others

2001-01

6796

657218.93

 

Others

2001-02

5807

621426.13

 

Indirect

 

6894

824988.69

 

Indirect

2001-01

2904

350343.02

 

Indirect

2001-02

3990

474645.67

F

 

 

14274

1552931.54

F

 

2001-01

7540

768012.01

F

 

2001-02

6734

784919.53

F

Direct

 

7209

752861.03

F

Direct

2001-01

4001

387000.9

F

Direct

2001-02

3208

365860.13

F

Others

 

4542

472249.42

F

Others

2001-01

2486

242615.9

F

Others

2001-02

2056

229633.52

F

Indirect

 

2523

327821.09

F

Indirect

2001-01

1053

138395.21

F

Indirect

2001-02

1470

189425.88

M

 

 

25650

2663902.13

M

 

2001-01

13199

1345697.12

M

 

2001-02

12451

1318205.01

M

Direct

 

13218

1360338.89

M

Direct

2001-01

7038

719146.28

M

Direct

2001-02

6180

641192.61

M

Others

 

8061

806395.64

M

Others

2001-01

4310

414603.03

M

Others

2001-02

3751

391792.61

M

Indirect

 

4371

497167.6

M

Indirect

2001-01

1851

211947.81

M

Indirect

2001-02

2520

285219.79

In the results, we can see that counts and sums of amount_sold are returned at the following levels:

  • By GENDER, CHANNEL_CLASS, and MONTH
  • Subtotals by MONTH within CHANNEL_CLASS
  • Subtotals by MONTH within GENDER
  • Subtotals by CHANNEL_CLASS within GENDER
  • Subtotals by MONTH
  • Subtotals by CHANNEL_CLASS
  • Subtotals by GENDER
  • Grand total

     

{mospagebreak title=Analysis Functions}

Oracle Database 10g provides a number of ranking and statistical functions that previously would have required some pretty heavy SQL to perform, or an extract to a third-party application. In this section, we will look at the analysis function available and provide examples of their use where appropriate.


NOTE

Some of the functions in this section are based on complex statistical calculations. Don’t be worried if you are unfamiliar with these concepts. It is more important for you to know that these functions exist than it is to understand the theory behind them.


Ranking Functions

Ranking functions provide the ability to rank a row of a query result relative to the other rows in the result set. Common examples of uses for these functions include identifying the top ten selling products for a period, or classifying or grouping a salesperson’s commissions into one of four buckets. The ranking functions included in Oracle Database 10g are

  1. rank
  2. dense_rank
  3. cume_dist
  4. percent_rank
  5. ntile
  6. row_number

rank and dense_rank Functions  The simplest ranking functions are rank and dense_rank. These functions are very similar and determine the ordinal position of each row within the query result set. The difference between these two functions is that rank will leave a gap in the sequence when there is a tie for position, whereas dense_rank does not leave a gap. The results of the following listing illustrate the difference between the two.

select prod_id,
       sum(quantity_sold),
       rank () over (order by sum(quantity_sold) desc) as rank,
       dense_rank () over (order by sum(quantity_sold) desc) as
dense_rank
from sales
where to_char(time_id, ‘yyyy-mm’) = ‘2001-06′
group by prod_id;
PROD_ID SUM(QUANTITY_SOLD) RANK  DENSE_RANK
—— —————- —- ———
     24                 762   1          1 
     30                 627   2          2
    147                 578   3          3
    
33                 552   4          4
    
40                 550   5          5
   
133                 550   5          5
    
48                 541   7          6
   
120                 538   8          7
    
23                 535   9          8
   
119                 512  10          9
   
124                 503  11         10
   
140                 484  12         11
   
148                 472  13         12
   
139                 464  14         13
   
123                 459  15         14
   
131                 447  16         15
    
25                 420  17         16
   
135                 415  18         17
   
137                 407  19         18
   
146                 401  20         19

As you can see, the ordinal position 6 does not exist as a value for rank, but it does for dense_rank. If from this result set you wanted to see the top ten listings for prod_id, you would use the original query as a subquery, as in the following listing:

select * from
 
(select prod_id,
        
sum(quantity_sold),
        
rank () over (order by sum(quantity_sold) desc) as rank,
        
dense_rank () over (order by sum(quantity_sold) desc) as
dense_rank
  from sales
  where to_char(time_id, ‘yyyy-mm’) = ‘2001-06′
  group by prod_id)
where rank < 11;

To see the bottom ten prod_ids, use the same query, but change the order by option from descending (desc) to ascending (asc). The next two functions, cume_dist and percent_rank, are statistical in nature but still part of the family of ranking functions.

cume_dist and percent_rank Functions  The cume_dist function calculates the cumulative distribution of a value in a group of values. Since this is not a statistics beginner’s guide, we will not attempt to provide the theoretical background on cumulative distribution. However, we can offer these points:

  • The range of values returned by cume_dist is always between 0 and 1.
  • The value returned by cume_dist is always the same in the case of tie values in the group.
  • The formula for cumulative distribution is

# of rows with values = value of row being evaluated 
                     # of rows being evaluated

Looking at the query we used when discussing rank, the following listing calculates the cumulative distribution for quantity_sold. The results immediately follow the listing.

select prod_id,
   sum(quantity_sold),
   cume_dist () over (order by sum(quantity_sold) asc) as
cume_dist
from sales
where to_char(time_id, ‘yyyy-mm’) = ‘2001-06′
group by prod_id
order by sum(quantity_sold) desc;
PROD_ID SUM(QUANTITY_SOLD)  CUME_DIST
——- —————— ———–
     24                762          1
     30                627 .985915493
    147                578 .971830986
     33                552 .957746479
     40                550 .943661972
    133                550 .943661972
     48                541 .915492958
    120                538 .901408451
    
23                535 .887323944
    119                512 .873239437
    124                503 .85915493
    140                484 .845070423
    148                472 .830985915
    139                464 .816901408
    123                459 .802816901
    131                447 .788732394
    
25                420 .774647887
    135                415 .76056338
    137                407 .746478873
    146                401 .732394366

The percent_rank function is similar to the cume_dist function, but calculates a percentage ranking of a value relative to its group. Again, without getting into the theory, we can make some points about percent_rank:

  • The range of values returned by the function is always between 0 and 1.
  • The row with a rank of 1 will have a percent rank of 0.
  • The formula for calculating percent rank is

rank of row within its group – 1
   # of rows in the group – 1

The next listing and its results demonstrate the percent_rank function using the base query we have been using in this section.

select prod_id,
       sum(quantity_sold),
       rank () over (order by sum(quantity_sold) desc) as rank,
       percent_rank ()
              
over (order by sum(quantity_sold) asc) as
percent_rank
from sales
where to_char(time_id, ‘yyyy-mm’) = ‘2001-06′
group by prod_id
order by sum(quantity_sold) desc;

PROD_ID

SUM(QUANTITY_SOLD)

RANK

PERCENT_RANK

————–

———————-

——

——————-

24

762

1

1

30

627

2

.985714286

147

578

3

.971428571

33

552

4

.957142857

40

550

5

.928571429

133

550

5

.928571429

48

541

7

.914285714

120

538

8

.9

23

535

9

.885714286

119

512

10

.871428571

124

503

11

.857142857

140

484

12

.842857143

148

472

13

.828571429

139

464

14

.814285714

123

459

15

.8

131

447

16

.785714286

25

420

17

.771428571

135

415

18

.757142857

137

407

19

.742857143

146

401

20

.728571429

 

The ntile Function  The ntile function divides a result set into a number of buckets specified at query time by the user, and then assigns each row in the result set a bucket number. The most common number of buckets used are 3 (tertiles), 4 (quartiles), and 10 (deciles). Each bucket will have the same number of rows, except in the case when the number of rows does not divide evenly by the number of buckets. In this case, each of the leftover rows will be assigned to buckets with the lowest bucket numbers until all leftover rows are assigned. For example, if four buckets were specified and the number of rows in the result set was 98, buckets 1 and 2 would have 25 rows each and buckets 3 and 4 would have 24 rows each.

 

Let’s look at an example. Using our base query of amount_sold in the SALES table, we want to look at amount_sold by product subcategory and rank the amounts into four buckets. Here’s the SQL:

select b.prod_subcategory,
      
sum(a.quantity_sold),
      
ntile(4) over (ORDER BY SUM(a.quantity_sold) desc) as
quartile
from sales a, products b
where a.prod_id = b.prod_id
and to_char(a.time_id, ‘yyyy-mm’) = ‘2001-06′
group by b.prod_subcategory;

As you can see in the following results, the number of product subcategories was not evenly divisible by the number of buckets specified (in this case, 4). Therefore, six subcategories were assigned to the first quartile (bucket 1) and five subcategories were assigned to the second, third, and fourth quartiles.

PROD_ SUM    
SUBCATEGORY    (A.QUANTITY_SOLD)    QUARTILE    ROWNUMBER    
——————– ——— —– —–
Accessories   3230 1 1  
Y Box Games   2572 1 2  
Recordable CDs   2278 1 3  
Camera Batteries   2192 1 4  
Recordable DVD Discs 2115 1 5
Documentation 1931 1 6
Modems/Fax 1314 2 7
CD-ROM 1076 2 8
Y Box Accessories 1050 2 9  
Printer Supplies     956 2 10  
Memory 748 2 11
Camera Media 664 3 12
Home Audio 370 3 13
Game Consoles   352 3 14
Operating Systems 343 3 15
Bulk Pack Diskettes 270 3 16
Portable PCs 215 4 17
Desktop PCs 214 4 18
Camcorders 196 4 19
Monitors 178 4 20
Cameras 173 4 21

{mospagebreak title=Windowing Functions}

Before we get into the details, we need to define a couple of terms: analytic partitioning and analytic window.

  1. Analytic partitioning is the division of the results of an analytic function into groups within which the analytic function operates. This is accomplished using the partition by clause of the analytic function. Do not confuse this partitioning with data partitioning discussed earlier in this chapter. Analytic partitioning can be used with any of the analytic functions we have discussed so far.
  2. An analytic window is a subset of an analytic partition in which the values of each row depend on the values of other rows in the window. There are two types of windows: physical and logical. A physical window is defined by a specified number of rows. A logical window is defined by the order by values.

Windowing functions can only be used in the select and order by clauses. They can be used to calculate the following:

  1. Moving sum
  2. Moving average
  3. Moving min/max
  4. Cumulative sum
  5. Statistical functions

Let’s look at an example of a moving sum function. The following shows the listing and results for calculating the moving sum from our SALES table by product category for a six-month period:

select b.prod_category,
       to_char(a.time_id, ‘yyyy-mm’),
       sum(a.quantity_sold),
       sum(sum(a.quantity_sold)) over (partition by b.prod_category
                               
order by to_char(a.time_id, ‘yyyy-mm’)
                               
rows unbounded preceding) as cume_sum
from sales a, products b
where a.prod_id = b.prod_id
and b.prod_category_id between 202 and 204 and to_char(a.time_id, ‘yyyy-mm’) between ‘2001-01′ and ‘2001-06′
group by b.prod_category, to_char(a.time_id, ‘yyyy-mm’)
order by b.prod_category, to_char(a.time_id, ‘yyyy-mm’);

 

PROD_CATEGORY

TO_CHAR

SUM(A.QUANTITY_SOLD)

CUME_SUM

————-

——-

—————–

——–

Hardware  2001-01 281 281
Hardware  2001-02 306 587
Hardware  2001-03 442 1029
Hardware  2001-04 439 1468
Hardware  2001-05 413 1881
Hardware  2001-06 429 2310
Peripherals and Accessories   2001-01 5439 5439
Peripherals and Accessories 2001-02 5984 11423
Peripherals and Accessories 2001-03 5104 16527
Peripherals and Accessories 2001-04 5619 22146
Peripherals and Accessories 2001-05 4955 27101
Peripherals and Accessories 2001-06 5486 32587
Photo 2001-01 2802 2802
Photo  2001-02 2220 5022
Photo  2001-03 2982 8004
Photo  2001-04 2824 10828
Photo  2001-05 2359 13187
Photo  2001-06 3225 16412

As you can see in the results, the moving sum is contained within each product category and resets when a new product category starts. In the past, windowing analysis used to require third-party products such as spreadsheet applications. Having the capabilities to perform these functions right in the database can streamline analysis and report generation efforts.

Other Functions

There are many other functions included with Oracle Database 10g that can be used to analyze data in large databases. While we will not be going into any detail for these, they are listed here for completeness:

  1. Statistical functions, including:
  2. Linear regression functions
  3. Descriptive statistics functions
  4. Hypothetical testing and crosstab statistics functions (contained a new PL/SQL package called dbms_statistics)
  5. first/last functions
  6. lag/lead functions
  7. Reporting aggregate functions
  8. Inverse percentile functions
  9. Hypothetical rank and distribution functions

As we stated earlier, we as database administrators do not have to know the theory behind the functions provided by Oracle Database 10g, or even how to use their results. However, we should be able to let our users know what capabilities are available. Knowing this, our users will be able to take advantage of these functions and construct efficient queries. In the next section, we will be discussing a new feature in Oracle Database 10g—SQL models.

CRITICAL SKILL 9.10

Create SQL Models

One of the more powerful data analysis features introduced in Oracle Database 10g is SQL models. SQL models allow a user to create multidimensional arrays from query results. Formulas, both simple and complex, can then be applied to the arrays to generate results in which the user is interested. SQL models allow inter-row calculations to be applied without doing expensive self-joins.

SQL models are similar to other multidimensional structures used in business intelligence applications. However, because they are part of the database, they can take advantage of Oracle Database 10g’s built-in features of scalability, manageability security, and so on. In addition, using SQL models, there is no need to transfer large amounts of data to external business intelligence applications.

A SQL model is defined by the model extension of the select statement. Columns of a query result are classified into one of three groups:

  1. Partitioning  This is the same as the analytic partitioning we defined in the Windowing Functions section.
  2. Dimensions  These are the attributes used to describe or fully qualify a measure within a partition. Examples could include product, sales rep id, and phone call type.
  3. Measures  These are the numeric (usually) values to which calculations are applied. Examples could include quantity sold, commission amount, and call duration.

One of the main applications of SQL models is projecting or forecasting measures based on existing measures. Let’s look at an example of the model clause to illustrate. The listing and its results show an aggregate query using the SALES table:

select c.channel_desc, p.prod_category, t.calendar_year year,
      
sum(s.quantity_sold) quantity_sold from sales s, products p, channels c, times t
where s.prod_id = p.prod_id
and   s.channel_id = c.channel_id
and   s.time_id = t.time_id
and   c.channel_desc = ‘Direct Sales’
group by c.channel_desc, p.prod_category, t.calendar_year
order by c.channel_desc, p.prod_category, t.calendar_year;

CHANNEL_DESC

PROD_CATEGORY

YEAR

QUANTITY_SOLD

————

————–

—-

————-

Direct Sales

Electronics

1998

7758

Direct Sales

Electronics

1999

15007

 

 

 

Direct Sales

Hardware

2000

1970

Direct Sales

Hardware

2001

2399

Direct Sales

Peripherals and Accessories

1998

44258

 

 

 

Direct Sales

Software/Other

2000

64483

Direct Sales

Software/Other

2001

49146

In the results, we see the historical aggregate quantity_sold for each year by product category for the ‘Direct Sales’ channel. We can use the model clause to project the quantity_sold. In the following listing, we will project values for 2002 for the product category Hardware in the channel. The quantity_sold will be based on the previous year’s value (2001), plus 10 percent. Table 9-18 explains the syntax of the listing.

 1  select channel_desc, prod_category,
   year, quantity_sold
 2  from
 3  (select c.channel_desc, p.prod_category,
   t.calendar_year year,
 4          sum(s.quantity_sold)
   quantity_sold
 5  from sales s, products p, channels c,
   times t
 6  where s.prod_id = p.prod_id
 7  and   s.channel_id = c.channel_id
 8  and   s.time_id = t.time_id
 9  group by c.channel_desc,
   p.prod_category, t.calendar_year) sales
10 where channel_desc = ‘Direct Sales’
11 model
12   partition by (channel_desc)
13   dimension by (prod_category, year)
14   measures (quantity_sold)
15   rules (quantity_sold['Hardware', 2002] 16                   = quantity_sold
   ['Hardware', 2001] * 1.10)
17 order by channel_desc, prod_category, 
   year;

 

Lines

Important Points

3–9

Define an in-line select that will be the source for the query. It is basically the same query we started with before the model clause.

10

Specifies that we are only going to look at ‘Direct Sales’ channels.

11

Specifies the model clause.

12

Specifies the partition by clause (in this case, channel_desc).

13

Specifies the dimension by clause (here, prod_category and year). These elements will fully qualify the measure within the channel_desc partition.

14

Specifies the measures clause (quantity_sold).

15–16

Specify the rules clause—that is, calculations we want to perform on the measure. In this example, we are referring to a specific cell of quantity_sold, described by the dimensions prod_category (Hardware) and year (2002).

TABLE 9-18.  Explanation of model Clause Syntax

 

Following are the results of the previous query. Notice that a new row has been added for Hardware in 2002. Its quantity_sold is 2638.9, which is the previous year’s value (2399) plus 10 percent.

CHANNEL_DESC

PROD_CATEGORY

YEAR

QUANTITY_SOLD

————

—————–

—-

————-

Direct Sales

Electronics

1998

7758

Direct Sales

Electronics

1999

15007

 

 

 

Direct Sales

Hardware

2000

1970

Direct Sales

Hardware

2001

2399

Direct Sales

Hardware

2002

2638.9

Direct Sales

Peripherals and Accessories

1998

44258

 

 

 

Direct Sales

Software/Other

2000

64483

Direct Sales

Software/Other

2001

49146

The model clause has many variations and allows for very powerful calculations. Let’s point out some of the characteristics and/or features you should be aware of. Supported functionalities include the following:

  • Looping (for example, FOR loops)
  • Recursive calculations
  • Regression calculations
  • Nested cell references
  • Dimension wildcards and ranges
  • The model clause does not update any base table, although in theory you could create a table or materialized view from the results of the query using the model clause.

Restrictions include the following:

  1. The rules clause cannot include any analytic SQL or windowing functions.
  2. A maximum of 20,000 rules may be specified. This may seem like plenty, but a FOR loop is expanded into many single-cell rules at execution time.

 {mospagebreak title=Project 9-2 Using Analytic SQL Functions and Models}

Once all the database structures have been put in place and data has been loaded, the users will want to analyze it. Knowing what functions are available is important, and so is their use as well, at least to some extent. So, in this project we will walk through a more complex analytical example that includes using the lag function and creating a SQL model.

Step by Step

  1. Create a view of the SALES table using the following listing. (The SALES table should have been created during your Oracle installation process.) This view will calculate the percentage change (called percent_chng) of quantity_sold from one year to the next using the lag function, summarized by prod_category, channel_desc, and calendar_year.

    create or replace view sales_trends
    as
    select p.prod_category, c.channel_desc, t.calendar_year year,
           sum(s.quantity_sold) quantity_sold,
           round((sum(s.quantity_sold) – 
                 lag(sum(s.quantity_sold),1)
                     over (partition by p.prod_category, c.channel_desc
                      order by t.calendar_year)) /
                 lag(sum(s.quantity_sold),1)
                      over (partition by p.prod_category, c.channel_desc
                       order by t.calendar_year) *
                 100 ,2) as percent_chng 
    from sales s, products p, channels c, times t
    where s.prod_id = p.prod_id
    and   s.channel_id = c.channel_id
    and   s.time_id = t.time_id
    group by p.prod_category, c.channel_desc, t.calendar_year;

  2. Select from the sales_trends view using the following listing. Notice that quantity_sold and percent_chng reset after each channel_desc. This is a result of the lag function’s partition by clauses in the view definition. 

    select prod_category, channel_desc, year, quantity_sold, percent_chng
    from sales_trends
    where prod_category = ‘Electronics’ order by prod_category, channel_desc, year;
  3. Select from the sales_trends view using the following listing that contains a model clause. In this query, we are projecting quantity_sold and percent_chng according to the following rules:

    a.  Filter the prod_category to only select ‘Electronics’.

    b.  Project for years 2002 to 2006 inclusive.

    c.  The projected quantity_sold is calculated as the previous year’s value plus the average percent_chng over the previous three years.

    d.  The projected percent_chng is the average percent_chng over the previous three years.

    select prod_category, channel_desc, year, quantity_sold, percent_chng
    from sales_trends
    where prod_category = ‘Electronics’ model
      partition by (prod_category, channel_desc)
      dimension by (year)
      measures (quantity_sold, percent_chng)
    rules (
      percent_chng[for year from 2002 to 2006 increment 1] =
        round(avg(percent_chng)[year between currentv()-3 and
          currentv()-1], 2),
    quantity_sold[for year from 2002 to 2006 increment 1] =
      round(quantity_sold[currentv()-1] * 
        (1 + (round(avg(percent_chng)[year between currentv()-3 and
               currentv()-1] ,2) / 100))))
    order by prod_category, channel_desc, year;

  4. Notice the projected values for 2002 to 2006 for each channel_desc.

Project Summary

The steps in this project build on the discussions we’ve had on Oracle Database 10g’s analytic capabilities. We used the lag function to calculate percentage change and used the model clause of the select statement to project sales five years into the future based on past trends. By going to the next level of examples, we can start to appreciate the significance of these functions and how they can be used.

Oracle Database 10g’s analytic functions provide powerful and efficient analysis capabilities that would otherwise require complex SQL and/or third-party tools. All of these functions are part of the core database—ready and waiting to be exploited by your users.

So, now we have come to the end of our discussions on large database features. A great deal of material has been presented in this chapter and we have really only seen the tip of the iceberg! However, you can feel confident that with this background information, you are primed to tackle almost any large database environment out there.

Chapter 9 Mastery Check

  1. Which of the following is not a valid combination for composite partitioning?

    A.  Range partitioning with hash partitioning

    B.  List partitioning with hash partitioning

    C.  Range partitioning with list partitioning
  2. What data population methods can be used on a compressed table that result in the data being compressed?
  3. _____________ partitioned indexes are defined independently of the data partitions, and _____________ partitioned indexes have a one-to-one relationship with the data partitions.
  4. Explain the functions of the Parallel Execution Coordinator in parallel processing.
  5. For which of the following SQL commands is parallel processing not enabled by default?

    A.  select

    B.  insert

    C.  create

    D.  alter
  6. What is meant by “degree of parallelism”?
  7. What new analytic feature can be used to forecast values based on existing measures?
  8. What two methods can be used to run the SQLAccess Advisor utilities for materialized views?
  9. Which of the following are components of an RAC system? (Select all that apply.)

    A.  Parallel Execution Server

    B.  Cluster Manager

    C.  Shared Storage

    D.  ASM Instance

    E.  Global Cache Service
  10. In a grid computing envrionment, the _____________ instance of a node manages the disk groups and maintains a map of data files and disks.
  11. _____________ partitioning allows you to control the placement of records in specified partitions based on sets of partition key values.
  12. What analytic function would you use to assign each row from a query to one of five buckets?
  13. What are the database components involved in grid computing?
  14. What are the two types of windows that can be used in analytic functions?
  15. The _____________ automatically collects workload and performance statistics used for database self-management activities.
  16. When creating partitioned tables, what option should you use to ensure that rows are redistributed to their correct partition if their partition key is updated?

     

[gp-comments width="770" linklove="off" ]

antalya escort bayan antalya escort bayan Antalya escort diyarbakir escort