November 3, 2004

Of Dell, dkms and dimwittery

Dell distributes machines which they support Linux on, and now they support us too. Hooray! However, they haven't quite gotten the whole Thing, yet. Ran into the following today, which caused some eyebrows to rise in the office since we do, after all, write ZenWorks Linux Management (nee Red Carpet).

Dell, you see, has this problem. They sell boxes which have RAID hardware controllers in them which require drivers. These drivers (under Linux) are dependent on the particular version of the kernel which is presently running. This is not surprising, as they are kernel modules. In any case, I spent some time yesterday assembling a system comprised mostly of a PowerEdge 2650 with a PeRC 3/DC (a badged MegaRAID card) and a PowerVault 2205 RAID shelf. Since we purchased the various components of this system separately, it turned out that we needed to upgrade the drivers for the MegaRAID card after I'd installed linux on the PowerEdge. No problem, Dell distributes drivers as RPM files...I located the driver files and downloaded the requisite rpms, after assuring their website that I was not a terrorist nor their butler.

I ended up with two rpms: dell-perc-2.10.yada.i386.rpm (okay, no mystery there) and dkms-yada-noarch.rpm. Hah? Well, these came in a tarball amongst other stuff, including a file named Said file was nought more than:


rpm -Uvh dkms-yada-noarch.rpm
rpm -Uvh dell-perc-2.10-yada.i386.rpm

Um, okay. Feeling a tad cavalier, I run that. dkms (whatever that is) installs, and then the fun starts. It turns out that dkms stands for Dynamic Kernel Management System, or some such fuckery. As rpm starts to load the dell-perc drivers, it copies over a bunch of files (fine) and then starts running some strange script. The script informs me that it is checking its prebuilt version against the running kernel; that the check failed; and that finally because I don't have the current kernel source rpms installed, it can't build and tag a new version of the driver or make a new initrd, and hence it's not recommended that I reboot the machine. It says all this very fast, in that wonderful I'm-scrolling-past-you-isn't-it-fun-reading-at-115200-baud? sort of way (said Pooh). Then it stops. Apparently, the script has exited successfully, because rpm now is convinced that the dell-perc drivers are installed.

However, they're not.

See, I don't have the kernel sources. And Dell apparently relies on this dkms thing to build them at install time (or, upon close examination, even at machine startup maybe...there's this evil looking init script, see...) and then make a new initrd for the machine with said drivers in it. I'm fortunate; the RAID volume I'm trying to see does not contain my system's running system drive. If it did, then any of the prior oopses would have rendered my system unable to boot.

This, of course, is a problem for our system management product, which relies on rpm. See, as far as RPM is concerned, everything is hunky-dory. However, dkms has failed to properly compile/make the drivers and initrd. If this was a system volume, and if this upgrade had been triggered by (let us say) a dependency check due to the installation of a new kernel version by zlm/red carpet, well, then, as soon as the box cycled, pssssht that's it, no more RAID volume.

I'm not entirely sure what the answer is. One possibility is that Dell should have the dell-perc driver actually have an rpm-level dependency on the kernel-sources. The problem with this is that the driver rpm doesn't know in advance which version of the kernel will be running on the machine, and I don't know if you can have a dependency on a package whose identity depends on the running kernel. If so, problem partially solved; if not, this doesn't help.

Even if so, however, we're not out of the woods. Just having the kernel sources doesn't guarantee that the additional steps of building the driver and/or initrd will be successful, but as far as I can tell, Dell has put all the additional hoo-hah into the postinstall script in the RPM - which means there's no way for that script to return a fail and have that result reflected in the rpm transaction.

I have nightmares about things like this, as an which are designed to allow me to manage servers remotely, including box OS updates and reboots, ending up hosing something critical like, say, the RAID shelf holding the box OS and data. Things that purport to make the box safer and more reliable ending up serving as a point of vulnerability - all because somebody made a dumb design decision as to how to distribute their drivers.

Personally? I would either avoid the entire 'on the fly fuckery' of dkms, or, if that's not viable (and I'm not qualified to say it is or isn't) then force manual driver installation in order to avoid luring the user into upgrade practices which could kill their installation due to remote procedures which aren't safe. Posted by jbz at November 3, 2004 4:25 AM | TrackBack

Post a comment

Remember personal info?