HomeOracle Page 7 - Oracle and Availability: Illustrated Downtime Scenarios
Waiting for the File to Restore from Tape - Oracle
Ever have to recover your servers from a flooded basement? Move a database to an new platform and need the weekend to test it but users need the database today? We will illustrate various database problems and which piece of technology could be employed to prevent the outage or to recover from it quickly. (From the book Oracle Database 10g High Availability with RAC, Flashback & Data Guard by Matthew Hart and Scott Jesse, ISBN: 0072254289, McGraw-Hill/Osborne, 2004.)
For our poor DBA at Horatio's Woodscrews, it only gets worse. He needs to restore the file WS_APP_DATA01.DBF from backup and then perform media recovery using archivelogs. He contacts the backup administrator and asks that the file be restored from tape.
After some shuffling around, the backup administrator finds the last backup and begins the tape restore. Because of the size of the file, this will take a few hours to get the file queued over the network from the tape jukebox to the database server. The DBA asks if there are any backups of the file on disk, knowing that there is plenty of unused space on the new SAN that could be used for disk backup of important files. The backup administrator just runs the tape servers, though, and knows nothing about the SAN. That's a different department.
Avoid Long Tape Restores with RMAN Flashback Recovery Area
Using new RMAN functionality, you can create backup jobs that always store at least the last full copy of the database on a disk location called the flashback recovery area. Then, when you take the next backup, RMAN will automatically age out the last backup, or you can set it up to move the old backups to tape from the disk area. When you go to initiate the restore from RMAN, it knows where the last best copy is. The restore can be instantaneous, as we can switch out the bad file with the new file and begin applying archivelogs immediately. We discuss this in Chapter 8.
Protect Against Node Failure with a Robust Archive Strategy
While RAC does provide a higher degree of protection against outages, you have to be careful how you configure the database so that you don't invent single points of failure. In this case, the archivelogs from Node2 are required for recovery, but Node2 is off the network temporarily. For help with archive strategies with RAC, see Chapter 5, in the section “Redo Logs and Media Recovery.” Also see Chapter 8, particularly Figure 8-8.
RAC and the Single Point of Failure
The DBA for Horatio's Woodscrews is no fool. He knows that he needs to transition the data to a RAC cluster. As part of the preparation, he has configured the Sales Department's smaller database on a RAC cluster with two nodes. He set it up quickly, with minimal configuration, and it's been running smoothly for a few weeks.
However, the network configuration issues that affected the production database had also affected the Sales system. One of the nodes in the cluster was not available on the network. The DBA wasn't worried, however, because the other node was still available.
But one of the files was asking for media recovery, so the DBA issued the recovery command and it asked for an archivelog that was required for recovery. The DBA pressed ENTER to accept the default archivelog. But the media recovery session failed. He looked a little closer, and noted that the requested archivelog was on the unavailable node. Suddenly, his RAC cluster was not operational as he waited for the other node to become available so he could perform recovery.
Rewinding the Database
Perhaps the most difficult outage situations are those tricky logical errors introduced by the users themselves—when a user updates the wrong table, or updates the wrong values. These types of errors are tough to overcome because they are not perceived by the database as errors, but just another transaction. Typically, user errors do not occur in a vacuum; an erroneous update can occur alongside hundreds of correct updates. Pretty soon, the bad data is buried by thousands of additional updates. How can you fish just one of those transactions out? Can you “rewind” the database back to a previous point in time?
Use Flashback Table to Restore a Table to a Previous State
In Oracle Database 10g, Oracle introduced the ability to rewind a table to a previous state without performing point-in-time recovery. This is called Flashback Table, and it's part of the Flashback Technologies discussed in Chapter 9.
You can also use LogMiner to review transactions in the archived redo logs in order to determine where exactly the bad data was entered, as well as to retrieve the good transactions entered after the bad transaction. LogMiner is discussed in Chapter 2.
At Horatio's Woodscrews, the problem was reported by the Accounting team. As they went through their month-end processing, they began to realize that the data seemed incorrect. All the roll-up values seemed to be impossibly high, even during a good month. They could not figure out what had happened, but the data had been incorrectly entered at some point in the morning. Now, in the afternoon, they came to the DBA and asked that he “start over” all of their tables as they had looked in the morning.
The options for restoring just a few tables are limited, without rolling the entire database back to a previous point-in-time. Oracle has provided for a tablespace point-in-time recovery (TSPITR), where just a single tablespace is restored to a previous point. But that is labor-intensive, and the smallest unit that can be restored is the entire tablespace. The Accounting group does not want to redo all their work for the day, just the work in a few tables.
This chapter is from Oracle Database 10g High Availability with RAC, Flashback & Data Guard, by Hart and Jesse. (McGraw-Hill/Osborne, 2004, ISBN: 0072254289). Check it out at your favorite bookstore today. Buy this book now.