"Peter Fraser" writes: > I had a disk drive fail while running RAIDframe. > The system did not survive the failure. Even worse > there was data loss.
Ow. > The system was to be my new web server. The system > had 1 Gig of memory. I was working, slowly, on > configuring apache and web pages. Moving to > a chroot'ed environment was none trivial. > > The disk drive died, the system crashed, Oh.... so it *wasn't* just a simple case of a drive dying, but the system crashed too... Well, RAIDframe can't make any guarantees when there's a system crash -- if buffers havn't been flushed or there's still pending meta-data to be written, there's not much RAIDframe can do about that... "those are filesystem issues". > and the > system rebooted and came up. Remove the > dead disk and replacing it with a new disk > and reestablishing the raid was no problem. > > But why was there a crash, I would of thought > that the system should run after a disk failure. You havn't said what types of disks. I've had IDE disks fail that take down the entire system. I've had IDE disks fail but the system remains up and happy. I've had SCSI disks fail that have made the SCSI cards *very* unhappy (and had the system die shortly after). None of these things can be solved by RAIDframe -- if the underlying device drivers can't "deal" in the face of lossage, RAIDframe can't do anything about that... You also havn't given any indication as to the nature of the crash, or what the panic message was (if any). (e.g. was it a null-pointer dereference, or a corrupted filesystem or something that went wrong in the network stack?) > And even more to my surprise, about two days > of my work disappeared. Of course, you just went to your backups to get that back, right? :) > I believe, the disk drive died about 2 days before > the crash. I also believe that RAIDframe did > not handle the disk drive's failure correctly Do you have a dmesg related to the drive failure? e.g. something that shows RAIDframe complaining that something was wrong, and marking the drive as failed? > and as a result all file writes to the failed > drive queued up in memory, I've never seen that behaviour... I find it hard to believe that you'd be able to queue up 2 days worth of writes without a) any reads being done or b) not noticing that the filesystem was completely unresponsive when a write of associated meta-data never returned... (on the first write of meta-data that didn't return, pretty much all IO to that filesystem should grind to a halt. Sorry.. I'm not buying the "it queued up things for two days"... ) > when memory ran out the system crashed. > > I don't know enough about OpenBSD internals to > know if my guess as to what happened is correct, > but it did worry me about the reliability of > RAIDframe. I've been running RAIDframe (albeit not w/ OpenBSD) in both production and non-production environments now for 7+ years... RAIDframe reliability is the least of my worries :) (RAIDframe has also saved mine and others' data on various occasions over the years...) > I am now trying ccd for my web pages and > ALTROOT in daily for root, I have not had a disk > fail with ccd yet, so I have not determined whether > ccd works better. "Good luck." (see a different thread for my thoughts on using ccd :) > Neither RAIDframe or ccd seems to be up the > quality of nearly all the other software > in OpenBSD. This statement is also true of the documentation. My only comment on that is that the version of RAIDframe in OpenBSD is somewhat dated. You are also encouraged to find and read the latest versions of the documentation, and to provide feedback to the author on what you feel is lacking. Later... Greg Oster