On Fri, Jun 5, 2009 at 4:20 PM, Tim Haley <tim.ha...@sun.com> wrote:
> Brent Jones wrote:
>>
>> Hello all,
>> I had been running snv_106 for about 3 or 4 months on a pair of X4540's.
>> I would ship snapshots from the primary server to the secondary server
>> nightly, which was working really well.
>>
>> However, I have upgraded to 2009.06, and my replication scripts appear
>> to "hang" when performing zfs send/recv.
>> When one zfs send/recv process hangs, you cannot send any other
>> snapshots from any other filesystem to the remote host.
>> I have about 20 file systems I snapshots and replicate nightly.
>>
>> The script I use to perform the snapshots is here:
>> http://www.brentrjones.com/wp-content/uploads/2009/03/replicate.ksh
>>
>> On the remote side, I end up with many "hung" processes, like this:
>>
>>  bjones 11676 11661   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>>  bjones 11673 11660   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>>  bjones 11664 11653   0 01:30:03 ?           0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>>  bjones 13727 13722   0 14:21:20 ?           0:00 /sbin/zfs recv -vFd
>> pdxfilu02
>>
>> And so on, one for each file system.
>>
>> On the receiving end, 'zfs list' shows one filesystem attempting to
>> receive a snapshot, but I cannot stop it:
>>
>> $ zfs list
>> NAME                                       USED  AVAIL  REFER  MOUNTPOINT
>> pdxfilu02/data/fs01/%20090605-00:30:00  1.74G  27.2T   208G
>> /pdxfilu02/data/fs01/%20090605-00:30:00
>>
>>
>>
>> On the sending side, I CAN kill the ZFS send process, but the remote
>> side leaves its processes going, and I CANNOT kill -9 them. I also
>> cannot reboot the receiving system, at init 6, the system will just
>> hang trying to unmount the file systems.
>> I have to physically cut power to the server, but a couple days later,
>> this issue will occur again.
>>
>>
> A crash dump from the receiving server with the stuck receives would be
> highly useful, if you can get it.  Reboot -d would be best, but it might
> just hang. You can try savecore -L.
>
> -tim
>
>> I'f I boot to my snv_106 BE, everything works fine, this issue has
>> never occurred on that version.
>>
>> Any thoughts?
>>
>
>

Well, I think I found a specific file system that is causing this.
I kicked off a zpool scrub to see if there might be corruption on
either end, but that takes well over 40 hours on these servers.


-- 
Brent Jones
br...@servuhome.net
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to