In 1989 I was on a team administering a five-gigabyte accounting database with about five hundred users, and we had a "routine" database software upgrade to do. I had already upgraded the database software on the much smaller development database, and it came off without a hitch. Thursday night was to be the night of the upgrade of the production database, and then began a comedy of errors that was to last until after midnight on Friday.
We were already off to a bad start because we were reacting to pressure from management to get the upgrade done quickly. We did not have time to plan and coordinate properly, and began thinking "This is so straightforward -- what could go wrong?" Things can always go wrong. One must keep in mind, not the likelihood of something going wrong, but the time and pain involved in recovering from a worst-case failure.
In the spirit of cross-training, I was not doing the production upgrade. I called in at about eleven, and was asked to come in. The upgrade had gone awry, the production database had halted and would not restart, and support had recommended a full recovery from backup, using the old software.
I felt sick. Such a recovery would require loading some 36 backup tapes, which alone would take hours, and then applying dozens of archived log files. If any of the salient tapes or files were missing, that was it, no database.
Tuesday night's tapes were all there, so we set about applying them to the database. It probably would have been a good idea to use the Tuesday tapes just for the tablespace that was incomplete on the Wednesday backup, and use Wednesday for the rest. That might have saved us some roll-forward time, but it was four in the morning at this point, and we were all lucky to be lucid, let alone incisive. Somehow, database disasters seem to peak between one and five in the morning, when the human mind is at its foggiest. Those are the conditions under which recovery procedures are put to the acid test. It was about noon by the time we had applied all of the backup tapes to the database and were ready to begin the roll-forward. We had all been awake for over 24 hours by now. Luckily, the recovery was straightforward from this point onward -- just read the log files from the tapes and type their names into the database as it requested them -- but we called support just to make sure that everything was kosher. We knew how may files had to be applied, but they were each taking anywhere from ten minutes to an hour and ten minutes to apply, so estimating overall time to completion was impossible. The hard thing about this last part was waiting and watching the screen for endless minutes until the database requested its next log file.
In retrospect, we might have saved a little time and a lot of tension if we had just fed a file containing all of the log file names directly into the monitor program. But here again, we were all so ragged that it was hard to think straight. Furthermore, at the time we were exploring new territory with a big data base, and rocking the boat any more than necessary seemed like a bad idea.
It was not until half past midnight on Friday that I watched over my colleague's shoulder as he typed in the name of the last log file. We got back the messages, "log applied" and "recovery complete". We started up the application daemons, and verified that they were working. It was time to go home.
A few days later we performed a clean shutdown on the production database, started up the new software, and upgraded without a hitch.
The software upgrade was unusual in that although the new version had a different file format, the new version was able to start an old-style database by reformatting the existing files the first time each was accessed. It was tantamount to crossing the Rubicon, however, since there was no way to change the files back to the old format.
Unfortunately, the software was capable of recovering from a fast shutdown and of translating the file formats on the fly, but not of doing both simultaneously. The reformatting routines could not properly convert abnormally shut down database files needing instance recovery (transaction rollbacks, etc.), hence our predicament. In the attempt to start up with the new version, the database files all got covnerted to the new format and corrupted (since they had not been shut down cleanly), so now neither the old nor the new version could start the database.
Still, I think that the main lesson I got from this exercise was in the importance of the human side. We started out on the wrong foot, and we fell down. We all ran ourselves out to the ragged edge of exhaustion and managed to pull the database back this time, but what about next time?
Post a Comment