Re: [zfs-discuss] System hang caused by a "bad" snapshot

Ben Miller Tue, 18 Sep 2007 12:25:22 -0700

> > > Hello Matthew,
> > > Tuesday, September 12, 2006, 7:57:45 PM, you
> > wrote:
> > > MA> Ben Miller wrote:
> > > >> I had a strange ZFS problem this morning.
>  The
>  > entire system would
>  > >> hang when mounting the ZFS filesystems.  After
>  > trial and error I
> > >> determined that the problem was with one of
>  the
>  > 2500 ZFS filesystems.
> > >> When mounting that users' home the system
>  would
>  > hang and need to be
>  > >> rebooted.  After I removed the snapshots (9 of
>  > them) for that
>  > >> filesystem everything was fine.
>  > >> 
>  > >> I don't know how to reproduce this and didn't
>  get
>  > a crash dump.  I
>  > >> don't remember seeing anything about this
>  before
>  > so I wanted to
>  > >> report it and see if anyone has any ideas.
>  > 
> > MA> Hmm, that sounds pretty bizarre, since I
>  don't
>  > think that mounting a 
>  > MA> filesystem doesn't really interact with
>  snapshots
>  > at all. 
>  > MA> Unfortunately, I don't think we'll be able to
>  > diagnose this without a 
>  > MA> crash dump or reproducibility.  If it happens
>  > again, force a crash dump
>  > MA> while the system is hung and we can take a
>  look
>  > at it.
>  > 
>  > Maybe it wasn't hung after all. I've seen similar
>  > behavior here
>  > sometimes. Did your disks used in a pool were
>  > actually working?
>  > 
>  
>  There was lots of activity on the disks (iostat and
> status LEDs) until it got to this one filesystem
>  and
>  everything stopped.  'zpool iostat 5' stopped
>  running, the shell wouldn't respond and activity on
>  the disks stopped.  This fs is relatively small
>    (175M used of a 512M quota).
>  Sometimes it takes a lot of time (30-50minutes) to
>  > mount a file system
>  > - it's rare, but it happens. And during this ZFS
>  > reads from those
>  > disks in a pool. I did report it here some time
>  ago.
>  > 
>  In my case the system crashed during the evening
>  and it was left hung up when I came in during the
>   morning, so it was hung for a good 9-10 hours.
> 
> The problem happened again last night, but for a
> different users' filesystem.  I took a crash dump
> with it hung and the back trace looks like this:
> > ::status
> debugging crash dump vmcore.0 (64-bit) from hostname
> operating system: 5.11 snv_40 (sun4u)
> panic message: sync initiated
> dump content: kernel pages only
> > ::stack
> 0xf0046a3c(f005a4d8, 2a100047818, 181d010, 18378a8,
> 1849000, f005a4d8)
> prom_enter_mon+0x24(2, 183c000, 18b7000, 2a100046c61,
> 1812158, 181b4c8)
> debug_enter+0x110(0, a, a, 180fc00, 0, 183e000)
> abort_seq_softintr+0x8c(180fc00, 18abc00, 180c000,
> 2a100047d98, 1, 1859800)
> intr_thread+0x170(600019de0e0, 0, 6000d7bfc98,
> 600019de110, 600019de110, 
> 600019de110)
> zfs_delete_thread_target+8(600019de080,
> ffffffffffffffff, 0, 600019de080, 
> 6000d791ae8, 60001aed428)
> zfs_delete_thread+0x164(600019de080, 6000d7bfc88, 1,
> 2a100c4faca, 2a100c4fac8, 
> 600019de0e0)
> thread_start+4(600019de080, 0, 0, 0, 0, 0)
> 
> In single user I set the mountpoint for that user to
> be none and then brought the system up fine.  Then I
> destroyed the snapshots for that user and their
> filesystem mounted fine.  In this case the quota was
> reached with the snapshots and 52% used without.
> 
> Ben


Hate to re-open something from a year ago, but we just had this problem happen 
again.  We have been running Solaris 10u3 on this system for awhile.  I 
searched the bug reports, but couldn't find anything on this.  I also think I 
understand what happened a little more.  We take snapshots at noon and the 
system hung up during that time.  When trying to reboot the system would hang 
on the ZFS mounts.  After I boot into single use and remove the snapshot from 
the filesystem causing the problem everything is fine.  The filesystem in 
question at 100% use with snapshots in use.

Here's the back trace for the system when it was hung:
> ::stack
0xf0046a3c(f005a4d8, 2a10004f828, 0, 181c850, 1848400, f005a4d8)
prom_enter_mon+0x24(0, 0, 183b400, 1, 1812140, 181ae60)
debug_enter+0x118(0, a, a, 180fc00, 0, 183d400)
abort_seq_softintr+0x94(180fc00, 18a9800, 180c000, 2a10004fd98, 1, 1857c00)
intr_thread+0x170(2, 30007b64bc0, 0, c001ed9, 110, 60002400000)
0x985c8(300adca4c40, 0, 0, 0, 0, 30007b64bc0)
dbuf_hold_impl+0x28(60008cd02e8, 0, 0, 0, 7b648d73, 2a105bb57c8)
dbuf_hold_level+0x18(60008cd02e8, 0, 0, 7b648d73, 0, 0)
dmu_tx_check_ioerr+0x20(0, 60008cd02e8, 0, 0, 0, 7b648c00)
dmu_tx_hold_zap+0x84(60011fb2c40, 0, 0, 0, 30049b58008, 400)
zfs_rmnode+0xc8(3002410d210, 2a105bb5cc0, 0, 60011fb2c40, 30007b3ff58, 
30007b56ac0)
zfs_delete_thread+0x168(30007b56ac0, 3002410d210, 600009a4778, 30007b56b28, 
2a105bb5aca, 2a105bb5ac8)
thread_start+4(30007b56ac0, 0, 0, 489a4800000000, d83a10bf28, 5000000000386)

Has this been fixed in more recent code?  I can make the crash dump available.

Ben
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] System hang caused by a "bad" snapshot

Reply via email to