HomeOracle Page 6 - Oracle and Availability: Illustrated Downtime Scenarios
Restarting Long-Running Transactions - Oracle
Ever have to recover your servers from a flooded basement? Move a database to an new platform and need the weekend to test it but users need the database today? We will illustrate various database problems and which piece of technology could be employed to prevent the outage or to recover from it quickly. (From the book Oracle Database 10g High Availability with RAC, Flashback & Data Guard by Matthew Hart and Scott Jesse, ISBN: 0072254289, McGraw-Hill/Osborne, 2004.)
In the event of some kind of instance outage, there will inevitably be transactions that will stop, and then have to be restarted manually. This can lead to significant shortcomings in time-sensitive reporting and processing.
At Horatio's Woodscrews, the developers that were sitting idle due to the Linux servers having no data decided to try and get a little work done by connecting to the production database to review some of the database structure. However, one of the developers had installed a beta version of the ODBC drivers. When this ODBC driver connected to the production database, it caused a heap corruption that resulted in an ORA-600. The ORA-600 caused the database to crash.
Just like that, all the long-running reports that had been generated by the Accounts Receivable teams began to receive the error ORA-3113 “End-of-File on Communication.” The DBA's pager started to beep, and beep, and beep. He was able to restart the database, but all of the long-running transactions had to be restarted.
Slow Crash Recovery
Even after the DBA restarted the database, it seemed to be hung at the startup command. There were no errors, but when he checked the alert log, he noted that the last line indicated the database was performing instance recovery (also known as crash recovery). This recovery must be done to recover any changes in the redo logs that have not been committed to the datafiles. Depending on the size of the redo logs, and the frequency of checkpoints, this can take a considerable amount of time. When there is a lot of pressure to make a database available as quickly as possible, waiting on a poorly tuned crash recovery session can be excruciating.
RAC Handles Instance Recovery from Surviving Nodes
With Real Application Clusters, should an instance that is part of the cluster crash, the instance recovery is handled immediately by one of the surviving instances in the cluster. We discuss this in Chapter 5. In addition, Oracle10g has, through Enterprise Manager, a new feature called the Redo Log Advisor, which will allow you to automatically tune redo logs so that checkpoints occur at a rate that allows for faster instance recovery in the event of a crash. This is discussed in Chapter 3.
Dealing with Block Corruption (ORA 1578)
At Horatio's Woodscrews, the system administrators finally installed the new network card. With the system down, they took the opportunity to install new drivers for their storage area network (SAN) device. Then they contacted the DBA to let him know he could start the database.
The DBA starts the database, and begins to run through the system checks to make sure all the applications will start up correctly. Then he initiates all the reports that failed because of the shutdown. Everything seems to be going smoothly, until he notices that one of the reports has an error:
As the DBA began to investigate the problem, the system administrators called him back: the new SAN driver is causing problems, and he should shut down the database immediately. But the damage has already been done—database corruption.
But there is only one corrupt block, in the WS_APP_DATA01.DBF file. Just one block out of thousands of perfectly good blocks. With datablock corruption, the solution to the problem is to restore the entire datafile from backup, and then apply archivelogs to recover the file. Because the datafile contains parts of the Woodscrew, woodscrew_orders, and woodscrew_inventory tables, these objects will be offline until the file can be restored and recovered. The brief outage for the hardware fix now has been extended to include the restore of a file from backup, and then the application of the archivelogs to that file. The tablespace will not be available until recovery completes.
Recovery Manager (RMAN) Provides Media Recovery of Data Blocks
When RMAN is utilized for backups, you can use those backups to restore a single block from the last good backup, and then perform media recovery on that block (apply archive log changes, if there are any). You can do it for a single block or for a list of corrupt blocks, and the tablespace stays online while you do recovery. In fact, the corrupt table stays online, too. We discuss this in Chapter 8.
This chapter is from Oracle Database 10g High Availability with RAC, Flashback & Data Guard, by Hart and Jesse. (McGraw-Hill/Osborne, 2004, ISBN: 0072254289). Check it out at your favorite bookstore today. Buy this book now.