For our poor DBA at Horatio's Woodscrews, it only gets worse. He needs to restore the file WS_APP_DATA01.DBF from backup and then perform media recovery using archivelogs. He contacts the backup administrator and asks that the file be restored from tape. After some shuffling around, the backup administrator finds the last backup and begins the tape restore. Because of the size of the file, this will take a few hours to get the file queued over the network from the tape jukebox to the database server. The DBA asks if there are any backups of the file on disk, knowing that there is plenty of unused space on the new SAN that could be used for disk backup of important files. The backup administrator just runs the tape servers, though, and knows nothing about the SAN. That's a different department. RAC and the Single Point of Failure The DBA for Horatio's Woodscrews is no fool. He knows that he needs to transition the data to a RAC cluster. As part of the preparation, he has configured the Sales Department's smaller database on a RAC cluster with two nodes. He set it up quickly, with minimal configuration, and it's been running smoothly for a few weeks. However, the network configuration issues that affected the production database had also affected the Sales system. One of the nodes in the cluster was not available on the network. The DBA wasn't worried, however, because the other node was still available. But one of the files was asking for media recovery, so the DBA issued the recovery command and it asked for an archivelog that was required for recovery. The DBA pressed ENTER to accept the default archivelog. But the media recovery session failed. He looked a little closer, and noted that the requested archivelog was on the unavailable node. Suddenly, his RAC cluster was not operational as he waited for the other node to become available so he could perform recovery. Rewinding the Database Perhaps the most difficult outage situations are those tricky logical errors introduced by the users themselves—when a user updates the wrong table, or updates the wrong values. These types of errors are tough to overcome because they are not perceived by the database as errors, but just another transaction. Typically, user errors do not occur in a vacuum; an erroneous update can occur alongside hundreds of correct updates. Pretty soon, the bad data is buried by thousands of additional updates. How can you fish just one of those transactions out? Can you “rewind” the database back to a previous point in time? At Horatio's Woodscrews, the problem was reported by the Accounting team. As they went through their month-end processing, they began to realize that the data seemed incorrect. All the roll-up values seemed to be impossibly high, even during a good month. They could not figure out what had happened, but the data had been incorrectly entered at some point in the morning. Now, in the afternoon, they came to the DBA and asked that he “start over” all of their tables as they had looked in the morning. The options for restoring just a few tables are limited, without rolling the entire database back to a previous point-in-time. Oracle has provided for a tablespace point-in-time recovery (TSPITR), where just a single tablespace is restored to a previous point. But that is labor-intensive, and the smallest unit that can be restored is the entire tablespace. The Accounting group does not want to redo all their work for the day, just the work in a few tables.
blog comments powered by Disqus |