Oracle and Availability: Illustrated Downtime Scenarios - Restarting Long-Running Transactions
(Page 6 of 9 )
In the event of some kind of instance outage, there will inevitably be transactions that will stop, and then have to be restarted manually. This can lead to significant shortcomings in time-sensitive reporting and processing.
At Horatio's Woodscrews, the developers that were sitting idle due to the Linux servers having no data decided to try and get a little work done by connecting to the production database to review some of the database structure. However, one of the developers had installed a beta version of the ODBC drivers. When this ODBC driver connected to the production database, it caused a heap corruption that resulted in an ORA-600. The ORA-600 caused the database to crash.
Just like that, all the long-running reports that had been generated by the Accounts Receivable teams began to receive the error ORA-3113 “End-of-File on Communication.” The DBA's pager started to beep, and beep, and beep. He was able to restart the database, but all of the long-running transactions had to be restarted.
Slow Crash Recovery
Even after the DBA restarted the database, it seemed to be hung at the startup command. There were no errors, but when he checked the alert log, he noted that the last line indicated the database was performing instance recovery (also known as crash recovery). This recovery must be done to recover any changes in the redo logs that have not been committed to the datafiles. Depending on the size of the redo logs, and the frequency of checkpoints, this can take a considerable amount of time. When there is a lot of pressure to make a database available as quickly as possible, waiting on a poorly tuned crash recovery session can be excruciating.
RAC Handles Instance Recovery from Surviving Nodes With Real Application Clusters, should an instance that is part of the cluster crash, the instance recovery is handled immediately by one of the surviving instances in the cluster. We discuss this in Chapter 5. In addition, Oracle10g has, through Enterprise Manager, a new feature called the Redo Log Advisor, which will allow you to automatically tune redo logs so that checkpoints occur at a rate that allows for faster instance recovery in the event of a crash. This is discussed in Chapter 3. |
Dealing with Block Corruption (ORA 1578)
At Horatio's Woodscrews, the system administrators finally installed the new network card. With the system down, they took the opportunity to install new drivers for their storage area network (SAN) device. Then they contacted the DBA to let him know he could start the database.
The DBA starts the database, and begins to run through the system checks to make sure all the applications will start up correctly. Then he initiates all the reports that failed because of the shutdown. Everything seems to be going smoothly, until he notices that one of the reports has an error:
ORA-1578: ORACLE data block corrupted (file # 121, block # 68)
As the DBA began to investigate the problem, the system administrators called him back: the new SAN driver is causing problems, and he should shut down the database immediately. But the damage has already been done—database corruption.
But there is only one corrupt block, in the WS_APP_DATA01.DBF file. Just one block out of thousands of perfectly good blocks. With datablock corruption, the solution to the problem is to restore the entire datafile from backup, and then apply archivelogs to recover the file. Because the datafile contains parts of the Woodscrew, woodscrew_orders, and woodscrew_inventory tables, these objects will be offline until the file can be restored and recovered. The brief outage for the hardware fix now has been extended to include the restore of a file from backup, and then the application of the archivelogs to that file. The tablespace will not be available until recovery completes.
Recovery Manager (RMAN) Provides Media Recovery of Data Blocks When RMAN is utilized for backups, you can use those backups to restore a single block from the last good backup, and then perform media recovery on that block (apply archive log changes, if there are any). You can do it for a single block or for a list of corrupt blocks, and the tablespace stays online while you do recovery. In fact, the corrupt table stays online, too. We discuss this in Chapter 8. |
This chapter is from Oracle Database 10g High Availability with RAC, Flashback & Data Guard, by Hart and Jesse. (McGraw-Hill/Osborne, 2004, ISBN: 0072254289). Check it out at your favorite bookstore today. Buy this book now.
|
Next: Waiting for the File to Restore from Tape >>
More Oracle Articles
More By McGraw-Hill/Osborne