October 13, 2006

OMG! I lost my RPM Database!

Yeah, so what do you do then? All the sources say...pray. Or reinstall. Or take some convoluted set of steps that are akin to dancing thrice round a cattail thrush on the third full moon of the year while wearing the ooze of a great-grandfather snail and chanting the nine billion names of God.

Well, here's what I did.

First, the situation: I have a Red Hat Enterprise Linux 3 machine. For reasons we won't go into, the RPM database was entirely erased. Not just corrupted; it was hurt so badly that a new, blank one was put in its place, and rpm happily claimed the machine had no packages on it. For other reasons we won't go into, reinstalling on this machine was possible but not desirable; the machine is remote (in a colo) and is running a role that would have required the building of a substitute box, switchover, travel, reinstall, etc.

No, it's not a critical role.

Anyhow, the machine in question hadn't been updated for several months - it was this overdue update that had caused the freakout. I was using the Ximian/Novell rug/rcd product to update it - this system front-ends RPM, though. It had resolved what needed to be done, resolved dependencies, downloaded the requisite rpm files for the update into a cache dir, and called RPM to remove the conflicting (old) RPM files when Something Went Wrong.

After thrashing around trying to rebuild the db using the tools (failed), trying to create a new db by figuring out what files I had on the system (bwahahahah), and looking at the originally-installed rpm db state file in /usr/lib/rpmdb (as opposed to the /var/lib/rpm/rhel-3as-i386/redhat location of the running database) I came within a hair of giving up.

Then I found out that rpm sticks a cron job into RHEL installations (and FC installations?) that simply dumps the output of rpm -qa to a logfile (/var/log/rpmpkgs) once a night. That logfile, thanks to logrotate, is rotated once a week, so although it had been several days since the incident and the main file had been overwritten with the blank rpm dump, I grabbed /var/log/rpmpkgs.1.

Inside that file was a one-per-line list of everything that had been installed on my system.

I took that file to a machine which had the full RHEL3 current package set mounted to a directory. I then created a new, blank RPM database:

rpm --dbpath /home/jbz/newrpmdb/ --initdb

Then I used the mounted copy of the package repository and xargs to do a series of rpm -i --justdb commands, which 'install' rpm packages but only perform the db modification steps:

cat rpmpkgs.1 | xargs --replace={} rpm -ivh --force --nodeps --justdb --dbpath /home/jbz/newrpmdb/ /(path-to-rhel-3as-i386-packagerepo-rpms)/{} > install.log 2>&1

The --force and --nodeps were required because I was installing one rpm at a time (via xargs) to a blank database, so all dep checks would fail, but I didn't care - the packages were already installed on my machine, I just needed the DB to reflect that.

Once that was complete, I had a directory containing my new rpmdb file in /home/jbz/newrpmdb. I copied that to the affected machine, put it into /var/lib/rpm, and then performed an rpm --rebuilddb just to be safe. That completed without a hitch.

At this point, I had a mostly-complete installation. There were a couple of problems, though. The main one was that since several packages had been updated since the machine had last been rereshed, some package names in the package repository didn't match the package names in the rpmpkgs.1 file. As a result, the logfile (install.log) contained a bunch of lines showing errors where rpm couldn't find the file it had been asked to install since the current repository had new file names.

Note: This is because Red Hat (and others?) when updating the packages apparently change the package name by infixing versions. For example, suppose aspell- is updated; the new package will have its true name, and that version might be renamed aspell-2: I presume this is in case there are dependencies which rely on the earlier version of the package; it's kept in the update channels but marked up so that the updaters understand that it isn't the current version. This is a guess, though.

In any case, I had several packages which hadn't been installed, spread throughout this massive logfile. I used the following to get a simple list of which ones got missed:

grep "No such" install.log | cut -c 70- - | cut -d " " -f 1 > missingrpms.txt

Note that the '70-' reflects the fact that the pathname of the repository I was using was 70 characters long, so cutting the first 70 characters off the line produced output beginning with the filename in question. Your mileage will vary. The second cut statement sets the delimiter to a space, then discards all text after the space, leaving only the filename.

I then got lazy and, since my list was only a few lines long, manually copied those RPMs over to the affected machine and force-installed them using --just-db. You may need/want to script this update.

At that point, I ran rpm -aV to verify the installation. I came up with approximately forty lines of variance, mostly permissions variance, which is not at all out of whack for a 350-day uptime, 2.5yr install server. No major issues have cropped up. Looking through the variances, they all appear to involve config file changes, permissions on doc files, or (in five or six cases) missing files from packages I have customized. I will mark this 'successful' and count myself lucky.

WARNING: Your mileage WILL vary. Messing with your RPM database is NOT FUN and should NOT BE DONE on a running system unless there is NO OTHER OPTION. I had no other option, so that's why I did it. I take no responsibility for any damage you may do to your system or data in trying any of this; I offer it purely as a last-ditch recovery option, just in case you find yourself with nowhere else to go.

Good luck.

Posted by jbz at October 13, 2006 3:50 PM | TrackBack

Post a comment

Remember personal info?