>Number:         160777
>Category:       kern
>Synopsis:       RAID-Z3 causes fatal hang upon scrub/import on 9.0-BETA2/amd64
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Sat Sep 17 01:30:11 UTC 2011
>Closed-Date:
>Last-Modified:
>Originator:     Charlie
>Release:        9.0-BETA2/amd64
>Organization:
none
>Environment:
FreeBSD  9.0-BETA2 FreeBSD 9.0-BETA2 #0: Wed Aug 31 18:07:44 UTC 2011     
r...@farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

>Description:
RAID-Z3 causes fatal hang upon scrub/import on 9.0-BETA2/amd64.

By fatal hang, I mean: (1) the hard drive LEDs freeze in a static state of on 
or off (rather than flashing to indicate drive activity) and stay there; (2) 
the console no longer responds to any keypress events such as space bar or 
Control-Alt-F2; (3) the system entirely stops responding to pings.

I noticed this initially when I tried running "zdb pool" while I was doing a 
"zpool scrub pool", and then the system crashed.  I had thought "zdb pool" 
would be a read only operation just to give me some interesting metadata I 
could page through.  But, rest assured, when I attempted to narrow down what 
was faulty or problematic here, I didn't touch that command with a ten foot 
pole (although, in the case where I confirmed that the system was working 
properly, such as with RAID-Z2, "zdb pool" didn't cause a problem).  I think 
anyhow that "zdb pool" must have consumed too much memory and so the machine 
crashed.  This was the first time the machine had been up and I had created the 
array in that boot.

So, the first time I attempted to "zpool import pool" after initial creation, I 
could see all drives being accessed for about a minute or so (positive 
activity), but then after that minute, the system fatally stalled, as described 
above.  I had tried "zpool scrub -s pool", and was only able to see the data at 
all by running "zpool export pool && zpool import -o readonly=on pool".  Then 
when I tried importing it read-write again, there was a stall.  It wasn't 
necessary to have the pool be disconnected without a clean dismount.  In fact, 
when I tried repeating the problem with a fresh creation of a new zpool (after 
a proper zpool destroy of the old one), I found that it was the "zpool import" 
or "zpool scrub" process alone that triggered the fatal stall.

I sincerely hope this is helpful.  I've switched to RAID-Z2 for now, 
unfortunately.  Rest assured, I would be able to do much more rigourous testing 
on ZFS.  If this problem is confirmed and fixed by 9.0 I can offer a 
contribution of uncovering more bugs with a debugged kernel enabled.  In the 
meantime I need to move forward.
>How-To-Repeat:
zpool create -O checksum=sha256 -O compression=gzip-9 pool raidz3 gpt/foo*.eli

zfs create -o checksum=sha256 -o compression=gzip-9 -o copies=3 pool/pond

zpool scrub pool
# or:
zpool export pool && zpool import pool

(Both of these seem to trigger the fatal stall as described above).

The following conditions may or may not apply.  I don't have the resources or 
time to check.  But, (1) the drives are 3TB each; (2) I partitioned the drives 
using GPT and one large labelled partition each with 99% capacity allocated to 
it; (3) I am using geli on the large partition.  If it seems that these factors 
are what are causing the problem, note that when I choose to create a RAID-Z2 
pool instead of RAID-Z3, there is no problem at all.  I can also confirm that 
the entirety of the drives is accessible, since I did a full dd to the entire 
drive (partition sector, metadata and all), so it is not a matter of the kernel 
not seeing the drive size properly.  In any case I would expect a graceful 
error from the kernel instead of this kind of stall.  I haven't attempted to 
move past the actual stall condition such as by kernel debugging, but the 
reproducibility of the problem leads me to suspect that might not be necessary. 
>Fix:
Unknown.  I can confirm that if I use RAID-Z2 and do many "zpool import" and 
"zpool export" commands back to back as well as "zpool scrub" then there is no 
problem at all.

>Release-Note:
>Audit-Trail:
>Unformatted:
_______________________________________________
freebsd-bugs@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"

Reply via email to