Re: [zfs-discuss] Hanging receive

Andrew Robert Nicols Wed, 08 Jul 2009 01:26:28 -0700

On Wed, Jul 08, 2009 at 08:43:17AM +1200, Ian Collins wrote:
> Ian Collins wrote:
>> Brent Jones wrote:
>>> On Fri, Jul 3, 2009 at 8:31 PM, Ian Collins<i...@ianshome.com> wrote:
>>>  
>>>> Ian Collins wrote:
>>>>    
>>>>> I was doing an incremental send between pools, the receive side is
>>>>> locked up and no zfs/zpool commands work on that pool.
>>>>>
>>>>> The stacks look different from those reported in the earlier "ZFS
>>>>> snapshot send/recv "hangs" X4540 servers" thread.
>>>>>
>>>>> Here is the process information from scat (other commands hanging on
>>>>> the pool are also in cv_wait):
>>>>>
>>>>>       
>>>> Has anyone else seen anything like this?  The box wouldn't even
>>>> reboot, it had to be power cycled.  It locks up on receive regularly
>>>> now.
>>>
>>> I hit this too:
>>> 6826836
>>>
>>> Fixed in 117
>>>
>>> http://opensolaris.org/jive/thread.jspa?threadID=104852&tstart=120
>>>   
>> I don't think this is the same problem (which is why a started a new  
>> thread), a single incremental set will eventually lock the pool up,  
>> pretty much guaranteed each time.
>>
> One more data point: 
>
> This didn't happen when I had a single pool (stripe of mirrors) on the  
> server.  It started happening when I split the mirrors and created a  
> second pool built from 3 8 drive raidz2 vdevs.  Sending to the new pool  
> (either locally or from another machine) causes the hangs.


And here are my data points:

We were running two X4500s under Nevada 112 but came across this issue on
both of them. On receiving much data through a ZFS receive, they would lock
up. Any zpool or zfs commands would hang and were unkillable. The only way
to resolve the situation was to reboot without syncing disks. I reported
this in some posts back in April
(http://opensolaris.org/jive/click.jspa?searchID=2021762&messageID=368524)

One of them had an old enough zpool and zfs version to down/up/sidegrade to
Solaris 10 u6 and so I made this change.
The thumper running Solaris 10 is now mostly fine - it normally receives an
hourly snapshot with no problem.

The thumper unning 112 has continued to experience the issues described by
Ian and others. I've just upgraded to 117 and am having even more issues -
I'm unable to receive or roll back snapshots, instead I see:

506 r...@thumper1:~> cat snap | zfs receive -vF thumperpool
receiving incremental stream of vlepool/m...@200906182000 into 
thumperp...@200906182000
cannot receive incremental stream: most recent snapshot of thumperpool does not
match incremental source

511 r...@thumper1:~> zfs rollback -r thumperpool/m...@200906181800
cannot destroy 'thumperpool/m...@200906181900': dataset already exists

As a result, I'm a bit scuppered. I'm going to try going back to by 112
installation instead to see if that resolves any of my issues.

All of our thumpers have the following disk configuration:
4 x 11 Disk raidz2 arrays with 2 disks as hot spares in a single pool.
2 disks in a mirror for booting.

When zpool locks up on the main pool, I'm still able to get a zpool status
on the boot pool. I can't access any data on the pool which is locked up.

Andrew

-- 
Systems Developer

e: andrew.nic...@luns.net.uk
im: a.nic...@jabber.lancs.ac.uk
t: +44 (0)1524 5 10147

Lancaster University Network Services is a limited company registered in
England and Wales. Registered number: 4311892. Registered office:
University House, Lancaster University, Lancaster, LA1 4YW

signature.asc
Description: Digital signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Hanging receive

Reply via email to