"Peter Fraser" writes:
> I had a disk drive fail while running RAIDframe.
> The system did not survive the failure. Even worse
> there was data loss.

Ow.  

> The system was to be my new web server. The system
> had 1 Gig of memory.  I was working, slowly, on
> configuring apache and web pages. Moving to
> a chroot'ed environment was none trivial.
> 
> The disk drive died, the system crashed, 

Oh.... so it *wasn't* just a simple case of a drive dying, but the 
system crashed too...  Well, RAIDframe can't make any guarantees when 
there's a system crash -- if buffers havn't been flushed or there's 
still pending meta-data to be written, there's not much RAIDframe can 
do about that... "those are filesystem issues".

> and the
> system rebooted and came up. Remove the
> dead disk and replacing it with a new disk
> and reestablishing the raid was no problem.
> 
> But why was there a crash, I would of thought
> that the system should run after a disk failure.

You havn't said what types of disks.  I've had IDE disks fail that 
take down the entire system.  I've had IDE disks fail but the system 
remains up and happy.  I've had SCSI disks fail that have made the 
SCSI cards *very* unhappy (and had the system die shortly after).  
None of these things can be solved by RAIDframe -- if the underlying 
device drivers can't "deal" in the face of lossage, RAIDframe can't 
do anything about that...

You also havn't given any indication as to the nature of the crash, 
or what the panic message was (if any).  (e.g. was it a null-pointer 
dereference, or a corrupted filesystem or something that went wrong 
in the network stack?)

> And even more to my surprise, about two days
> of my work disappeared.

Of course, you just went to your backups to get that back, right? :)

> I believe, the disk drive died about 2 days before
> the crash. I also believe that RAIDframe did
> not handle the disk drive's failure correctly

Do you have a dmesg related to the drive failure?  e.g. something 
that shows RAIDframe complaining that something was wrong, and 
marking the drive as failed?  

> and as a result all file writes to the failed
> drive queued up in memory,

I've never seen that behaviour...  I find it hard to believe that 
you'd be able to queue up 2 days worth of writes without a) any reads 
being done or b) not noticing that the filesystem was completely 
unresponsive when a write of associated meta-data never returned...  
(on the first write of meta-data that didn't return, pretty much all
IO to that filesystem should grind to a halt.  Sorry.. I'm not buying 
the "it queued up things for two days"... )

> when memory ran out the system crashed. 
> 
> I don't know enough about OpenBSD internals to
> know if my guess as to what happened is correct,
> but it did worry me about the reliability of
> RAIDframe.

I've been running RAIDframe (albeit not w/ OpenBSD) in both 
production and non-production environments now for 7+ years...  
RAIDframe reliability is the least of my worries :) 
(RAIDframe has also saved mine and others' data on various occasions 
over the years...)
 
> I am now trying ccd for my web pages and 
> ALTROOT in daily for root, I have not had a disk
> fail with ccd yet, so I have not determined whether
> ccd works better.

"Good luck."  (see a different thread for my thoughts on using ccd :)

 
> Neither RAIDframe or ccd seems to be up the
> quality of nearly all the other software
> in OpenBSD. This statement is also true of the documentation.

My only comment on that is that the version of RAIDframe in OpenBSD 
is somewhat dated.  You are also encouraged to find and read the 
latest versions of the documentation, and to provide feedback
to the author on what you feel is lacking.

Later...

Greg Oster

Reply via email to