Let me drop this in your lap. It's an extremely serious problem for users of a specific RAID system (you know, the people who are paranoid that anything should go wrong with their data or the availability of their server): it crashes the server and messes up that data. Evidence suggests (also look at that bug report mentioned below!) that SANE may be the guilty party. I will just reproduce the last message I sent to some parties involved (with [...] used to omit some irrelevant text and to avoid divulging some identities), which may be rather verbose, but you never *really* know what's exactly relevant and/or convincing enough and/or interesting.
Raf Schietekat wrote: > Note: ***Urgent***: If you (Mandrake and maybe IBM) would like to have > me perform specific tests on my system, perhaps with Mandrake 9.2 RC1, > it will almost have to be this week, because next week I'd like to bring > the server into production. Please use this opportunity! No reaction, BTW, and now it's too late (I probably should have come here before, but I did not know what SANE was, and then my message was blocked for a while before I sent it again), or it would have to be to help with a very targeted, convincing, and quick intervention (I would have to invest time in a complete reinstallation, which I am obviously reluctant to do). My workaround will be a cron(-like?) task that will disable anything related to scanners every minute or so (the frequency of the existing msec security check), to protect against accidental updates that reinstate the code. > > Brief description: Mandrake 9.1 crashes systems with ServeRAID. > Extensive report below, including a reference to a previous bug report, > currently marked as needing further information (well, here is the info). > > Raf Schietekat wrote: >[...] >> For [...], whom I've >> included in cc, a resume, in case you want to step in: I've been >> test-running an IBM xSeries 235 with ServeRAID 5i for several weeks, >> with Mandrake 9.1 (probably still the most recent version). Yesterday, >> I inserted two 3COM NICs in bus B, which also carries the ServeRAID 5i >> card. To test that the latter was still independently running at full >> 100 MHz speed as in the documentation and not dragged down to the >> NICs' 33 MHz, I did "time tar cf - / | wc -l", which showed about 7.5 >> MB/s throughput as before (unless it was more like 10 MB/s before, I'm >> not exactly sure). I then used drakconf to see whether the NICs were >> identified correctly. I did this from a remote ssh -X session, which >> froze up. I could not open another ssh connection. On the console >> itself, the mouse pointer was still moving, but I could not type >> anything into the logon screen. The bottom two drives were spinning >> continuously, while the top one wasn't doing anything, this for a RAID >> 5 setting involving all three drives. Since nothing seemed to work, I >> did a reset (small button, I hope I shouldn't have used, e.g., the >> power button instead). During reboot, the file system proved to be >> corrupted, and could not be repaired (I will have to find out how to >> do that, or reinstall everything). >> >> After some further research using www.google.com for ["Mandrake 9.1" >> ServeRAID], which at first didn't seem necessary because I had >> repeatedly and successfully done all these steps before and the only >> new thing were the two NICs on bus B (the same bus that carries the >> ServeRAID 5i card), it appears that I may have been bitten by what's >> dealt with on the following page: >> >> http://qa.mandrakesoft.com/show_bug.cgi?id=3421 >> (this is where I saw Thierry Vignaud's address; I've found [...]'s >> address in /usr/sbin/scannerdrake on a Mandrake 9.0 installation) > > > I've got my system up and running again. This involved the following: > - /dev/sda8 had disappeared, although its neighbours were still there, > - I tried MAKEDEV, but this uses /usr/bin/perl, on /dev/sda7, which was > not yet mounted, > - I did "mount /usr", > - I did ./MAKEDEV, > - I rebooted, and things seemed fine. > Then I wanted to try a few things to see whether I could pinpoint the > problem. Here is a complete account of what I did, probably erring on > the side of giving too much information, but in the hope that it will be > helpful for you to fix Mandrake's configuration managers etc. (I suggest > that a probe for ServeRAID precedes and disables a probe for a scanner, > perhaps with user input, unless the scanner probe can be changed so that > it does no damage to the ServeRAID controller card configuration). > The system now only has a(n extra) NIC on bus A, which is separate from > bus B which also carries the ServeRAID controller card. If I do "# > scannerdrake" from a remote ssh -X session (I like to work from my > laptop; the server is in a little server room), the system wants to > install some packages, but I refuse to cooperate. It then says that it > is scanning, or something (gone too fast for me to be able to read), and > then it says "IBM SERVERAID is not in the scanner database, configure it > manually?" (an obvious sign that something is going wrong with the > scanner probe). I repond No. It then says "IBM 32P0042a S320 1" is not > in the scanner database, configure it manually?". Don't even know what > that is. I respond No. Then it does the same for "IBM SERVERAID" again, > I respond No. And the same again for the other one, I respond No. Then I > get the following panel: > - title: Scannerdrake > - text: There are no scanners found which are available on your system. > - button: Search for new scanners > - button: Add a scanner manually > - button: Scanner sharing > - button: Quit > I persevered, and clicked "Search for new scanners", well, that's the > same as before, from just after the scanning. No crash yet. I did Quit. > Then I did vi `which harddrake2`, and I tried to add the line that > [...] suggested (next if $Ident =~ "SCANNER";), but then vi froze > (perhaps some of the file was still in memory from a previous vi > session, but then it wanted to access the disk?). The other ssh sessions > continued to work, unlike during the previous failure; I tried man perl > in another one to try and see an explanation for double quotes ([...]) > vs. slashes (harddrake2), but I got the error "-bash: /usr/bin/man: > Input/output error", repeatedly. I can still open other ssh sessions, > and the console itself works, but I see that all 3 drives have an amber > status light (not the green activity light, and if I remember correctly > the status light is normally off), and that the "System-error LED" is > lit on the "Operator information panel" (only other lit signs are > "Power-on LED" and "POST-complete LED"), with also one LED lit in the > "diagnostic LED panel" inside the computer, next to a symbol of a disk > and the letters "DASD". When I look next, the console has gone from > graphics to text mode, and is filling with messages about "EXT3-fs > error", "ext3_reserve_inode_write: IO failure", "ext3_get_inode_lock: > unable to get inode block". Meanwhile, the remote ssh sessions are still > responsive. I don't try anything on the console, and use a remote ssh > session to try "# shutdown -h now" as root, but obviously the command > cannot be read from disk (error message "bash: shutdown: command not > found"). ctl-alt-del on the console's keyboard: same thing (this causes > init to (try to) invoke shutdown). I then did a reset (actually a power > cycle; just a reset would have been better). The three drives were still > marked defunct (status lights on). I used the ServeRAID support CD to > boot, and could set two of the physical drives online, but the last one > did not have that right-click menu option (I even set the second one > defunct again, was able to bring the third online, but then the option > was missing on the second one). So then I briefly removed the second > drive from its hot-swap bay, and when I inserted it again it started > getting rebuilt from the other drives, and (according to the log) > completed a little over an hour later (for 30+ GB disk capacity, of > which maybe less than 1 GB in use, if that matters). I tell ServeRAID > Manager (?) to reboot, and then I'm stuck with a garbled Mandrake splash > screen and a succession of: > Boot: linux-secure > Loading linux-secure > Error 0x00 > and then a succession of just: > Boot: linux-securere > ctl-alt-del works (but brings no salvation). > Was data lost during the reset/power cycle (hopefully not during the > rebuild, because that would defeat the purpose of having a RAID), or as > early as the corruption of the ServeRAID controller card that > (ultimately?) set the drives to defunct state? Apparently the boot > doesn't even get to the stage where it would decide about clean state of > the file systems, so this is not something we can afford on a system in > production (evidence that recovery is not a simple matter and may > involve data recovery from backup, unless *perhaps* if a boot floppy > takes the system past this stage, after which ext2/ext3 gets a chance to > repair itself, but I have not boot floppy... (will make one now, though, > next chance I get)). > I reboot into diagnostics (PC DOCTOR 2.0, apparently a specific feature > of the IBM server), and the SCSI/RAID Controller test category passed. > Next I will proceeded to reinstall the whole system from scratch. > >> >> I'm not sure yet, though (why hasn't this happened before, and has a >> conclusion been reached?), that's why I've also cc'ed [...]. >> It seems strange however, if this is indeed the >> problem, that a hardware adapter card should prove so vulnerable to a >> probing method used for a different device (a scanner), but then again >> I have no close knowledge of these issues. >> >> BTW, the machine is not yet in production (I was going to do that, but >> I guess I can now wait a few days), and available for tests. >> >> I still think it's really unfortunate that there is no list of known >> *in*compatibilities, because who would suspect, with ServeRAID >> support, or drivers anyway, available for SuSE, TurboLinux, Caldera >> (SCO Group, the enemy!), and RedHat, that Mandrake would pose a >> problem? The same goes for Mandrake's site, of course (all of IBM is >> just "known hardware", and xSeries 235 and ServeRAID 5i are just absent). >> >> http://www-1.ibm.com/servers/enable/site/xinfo/linux/servraid >> (this is where I saw the address ipsli...@us.ibm.com) >> >> http://www.mandrakelinux.com/en/hardware.php3 >> >> [...] Raf Schietekat <raf_schiete...@ieee.org>