Living with Kryptonite: One Roll Forward, Two Rolls Back

Failure

Sometimes there are days when nothing seems to go right. In data processing, those days seem to come on harder and heavier than elsewhere. A good backup strategy is one that can weather days when Murphy gangs up on you and still give you back your data.

In 1989 I was on a team administering a five-gigabyte accounting database with about five hundred users, and we had a "routine" database software upgrade to do. I had already upgraded the database software on the much smaller development database, and it came off without a hitch. Thursday night was to be the night of the upgrade of the production database, and then began a comedy of errors that was to last until after midnight on Friday.

We were already off to a bad start because we were reacting to pressure from management to get the upgrade done quickly. We did not have time to plan and coordinate properly, and began thinking "This is so straightforward -- what could go wrong?" Things can always go wrong. One must keep in mind, not the likelihood of something going wrong, but the time and pain involved in recovering from a worst-case failure.

In the spirit of cross-training, I was not doing the production upgrade. I called in at about eleven, and was asked to come in. The upgrade had gone awry, the production database had halted and would not restart, and support had recommended a full recovery from backup, using the old software.

I felt sick. Such a recovery would require loading some 36 backup tapes, which alone would take hours, and then applying dozens of archived log files. If any of the salient tapes or files were missing, that was it, no database.

Recovery

We called the data center to pull the tapes from the last backup. When they arrived we found that the last tape was missing. The errant tape could not be found, so we had to go back to the preceding set of tapes, and then roll forward twice as far. This was not good, because it doubled not only the already substantial time to roll forward, but also the chance that one of the log tapes would be missing or unreadable, in which case that would be all she wrote.

Tuesday night's tapes were all there, so we set about applying them to the database. It probably would have been a good idea to use the Tuesday tapes just for the tablespace that was incomplete on the Wednesday backup, and use Wednesday for the rest. That might have saved us some roll-forward time, but it was four in the morning at this point, and we were all lucky to be lucid, let alone incisive. Somehow, database disasters seem to peak between one and five in the morning, when the human mind is at its foggiest. Those are the conditions under which recovery procedures are put to the acid test. It was about noon by the time we had applied all of the backup tapes to the database and were ready to begin the roll-forward. We had all been awake for over 24 hours by now. Luckily, the recovery was straightforward from this point onward -- just read the log files from the tapes and type their names into the database as it requested them -- but we called support just to make sure that everything was kosher. We knew how may files had to be applied, but they were each taking anywhere from ten minutes to an hour and ten minutes to apply, so estimating overall time to completion was impossible. The hard thing about this last part was waiting and watching the screen for endless minutes until the database requested its next log file.

In retrospect, we might have saved a little time and a lot of tension if we had just fed a file containing all of the log file names directly into the monitor program. But here again, we were all so ragged that it was hard to think straight. Furthermore, at the time we were exploring new territory with a big data base, and rocking the boat any more than necessary seemed like a bad idea.

It was not until half past midnight on Friday that I watched over my colleague's shoulder as he typed in the name of the last log file. We got back the messages, "log applied" and "recovery complete". We started up the application daemons, and verified that they were working. It was time to go home.

A few days later we performed a clean shutdown on the production database, started up the new software, and upgraded without a hitch.

Diagnosis

Why did the software upgrade succeed on the development database and fail on the production database? When doing the development upgrade, I had cleanly shut down the database as a matter of good procedure without even looking at the written instructions. Unfortunately, what seemed like cautious good sense to me was not so obvious to others with less experience. They had shut down the production database with a fast "shutdown abort" -- do not clean up, do not wait for users to log out. The fast shutdown was probably done out of impatience with the longer time to do a normal shutdown, and a lack of understanding of the resultant state of the database. In fact, the database was designed to restart and recover smoothly from a shutdown abort.

The software upgrade was unusual in that although the new version had a different file format, the new version was able to start an old-style database by reformatting the existing files the first time each was accessed. It was tantamount to crossing the Rubicon, however, since there was no way to change the files back to the old format.

Unfortunately, the software was capable of recovering from a fast shutdown and of translating the file formats on the fly, but not of doing both simultaneously. The reformatting routines could not properly convert abnormally shut down database files needing instance recovery (transaction rollbacks, etc.), hence our predicament. In the attempt to start up with the new version, the database files all got covnerted to the new format and corrupted (since they had not been shut down cleanly), so now neither the old nor the new version could start the database.

Lessons

I learned first hand that in a recovery situation, more than one thing can go wrong at once, and things can go wrong with the backup itself. Robust and easy-to-administer software blinded us to the critical nature of this upgrade, and we got sloppy with the shutdown, imagining that the database would take it in stride just like everything else. In the wake of our experience, technical support and development decided not to handle a file format conversion this way again.

Still, I think that the main lesson I got from this exercise was in the importance of the human side. We started out on the wrong foot, and we fell down. We all ran ourselves out to the ragged edge of exhaustion and managed to pull the database back this time, but what about next time?

Living with Kryptonite

2011-05-14

One Roll Forward, Two Rolls Back

Failure

Recovery

Diagnosis

Lessons

No comments:

Post a Comment