On Monday 21 September 2015 23:02:39 Karel Gardas wrote:
> Hello,
> 
> due to work on SR RAID1 check summing support where I've touched SR
> RAID internals (workunit scheduling) I'd like to test SR RAID5/6
> functionality on snapshot and on my tree to see that I've not broken
> the stuff while hacking it. My current problem is that I'm not able to
> come with some testing which would not break RAID5 (I'm starting with
> it) after several hours of execution while using snapshot. My test is
> basically:
> - on one console in loop
>   mount raid to /raid
>   rsync /usr/src/ to /raid
>   compute sha1 sums of all files in /raid
>   umount /raid
>   mount /raid
>   check sha1 -- if failure, fail the test, if not, just repeat
> - on another console in loop
>   - off line random drive
>   - wait random time (up to minute)
>   - rebuild raid with the offlined drive
>   - wait random time (up to 2 minutes)
>   - repeat
> 
> Now, the issue with this is that I get sha1 errors from time to time.
> Usually in such case the problematic source file contain some garbage.
> Since I do not yet have a machine dedicated to this testing, I'm using
> for this thinkpad T500 with one drive. I just created 4 RAID slices in
> OpenBSD partition. Last week I've been using vndX devices (and files),
> but this way I even got to kernel panic (on snapshot) like this one:
> http://openbsd-archive.7691.n7.nabble.com/panic-ffs-valloc-dup-alloc-td25473
> 8.html -- so this weekend I've started testing with slices and so far not
> panic, but still data corruption issue. Last snapshot I'm using for testing
> is from last Sunday.
> 
> Let me ask, should SR RAID5 survive such testing or is for example
> rebuilding with off-lined drive considered unsupported feature?

RAID5 should work (ignore RAID6 - it is still incomplete) and rebuilding 
should be functional:

 http://undeadly.org/cgi?action=article&sid=20150413071009

When I reenabled RAID5, I had tested it reasonably as I could, but it still 
needs to be put through its paces. How are you offlining the drive? If you're 
doing it via bioctl then it will potentially behave differently to a hardware 
failure (top down through the bio(4)/softraid(4) driver, instead of bottom up 
via the I/O path). If you can dependably reproduce the issue then I would 
certainly be interested in tracking down the cause.

Reply via email to