On Monday 21 September 2015 23:02:39 Karel Gardas wrote: > Hello, > > due to work on SR RAID1 check summing support where I've touched SR > RAID internals (workunit scheduling) I'd like to test SR RAID5/6 > functionality on snapshot and on my tree to see that I've not broken > the stuff while hacking it. My current problem is that I'm not able to > come with some testing which would not break RAID5 (I'm starting with > it) after several hours of execution while using snapshot. My test is > basically: > - on one console in loop > mount raid to /raid > rsync /usr/src/ to /raid > compute sha1 sums of all files in /raid > umount /raid > mount /raid > check sha1 -- if failure, fail the test, if not, just repeat > - on another console in loop > - off line random drive > - wait random time (up to minute) > - rebuild raid with the offlined drive > - wait random time (up to 2 minutes) > - repeat > > Now, the issue with this is that I get sha1 errors from time to time. > Usually in such case the problematic source file contain some garbage. > Since I do not yet have a machine dedicated to this testing, I'm using > for this thinkpad T500 with one drive. I just created 4 RAID slices in > OpenBSD partition. Last week I've been using vndX devices (and files), > but this way I even got to kernel panic (on snapshot) like this one: > http://openbsd-archive.7691.n7.nabble.com/panic-ffs-valloc-dup-alloc-td25473 > 8.html -- so this weekend I've started testing with slices and so far not > panic, but still data corruption issue. Last snapshot I'm using for testing > is from last Sunday. > > Let me ask, should SR RAID5 survive such testing or is for example > rebuilding with off-lined drive considered unsupported feature?
RAID5 should work (ignore RAID6 - it is still incomplete) and rebuilding should be functional: http://undeadly.org/cgi?action=article&sid=20150413071009 When I reenabled RAID5, I had tested it reasonably as I could, but it still needs to be put through its paces. How are you offlining the drive? If you're doing it via bioctl then it will potentially behave differently to a hardware failure (top down through the bio(4)/softraid(4) driver, instead of bottom up via the I/O path). If you can dependably reproduce the issue then I would certainly be interested in tracking down the cause.