Hello,

I'd like a sanity check from people more knowledgeable than myself.
I'm managing backups on a production system. Previously I was using another volume manager and filesystem on Solaris, and I've just switched to using ZFS.

My model is -
Production Server A
Test Server B
Mirrored storage arrays (HDS TruCopy if it matters)
Backup software (TSM)

Production server A sees the live volumes.
Test Server B sees the TruCopy mirrors of the live volumes. (it sees the second storage array, the production server sees the primary array)

Production server A shuts down zone C, and exports the zpools for zone C.
Production server A splits the mirror to secondary storage array, leaving the mirror writable.
Production server A re-imports the pools for zone C, and boots zone C.
Test Server B imports the ZFS pool using -R /backup.
Backup software backs up the mounted mirror volumes on Test Server B.

Later in the day after the backups finish, a script exports the ZFS pools on test server B, and re-establishes the TruCopy mirror between the storage arrays.

So.. I had this working fine with one zone on server A for a couple of months. This week I've added 6 more zones, each with two ZFS pools. The first night went okay. Last night, the test server B kernel panic'd well after the mirrored volumes zpools were imported, just after the TSM backup started reading all the ZFS pools to push it all to the enterprise backup environment.

Here's the kernel panic message -
Jul  6 03:04:55 riggs ^Mpanic[cpu22]/thread=2a10e81bca0:
Jul 6 03:04:55 riggs unix: [ID 403854 kern.notice] assertion failed: 0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp), file: ../../common/fs/zfs/dmu.c, line: 759
Jul  6 03:04:55 riggs unix: [ID 100000 kern.notice]
Jul 6 03:04:55 riggs genunix: [ID 723222 kern.notice] 000002a10e81b4f0 genunix:assfail+74 (7af0f8c0, 7af0f910, 2f7, 190d000, 12a1800, 0) Jul 6 03:04:55 riggs genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 0000000000000001 0000000000000001 00000300f20fdf81 Jul 6 03:04:55 riggs %l4-7: 00000000012a1800 0000000000000000 0000000001959400 0000000000000000 Jul 6 03:04:55 riggs genunix: [ID 723222 kern.notice] 000002a10e81b5a0 zfs:dmu_write+54 (300cbfd5c40, ad, a70, 20, 300b8c02800, 300f83414d0) Jul 6 03:04:55 riggs genunix: [ID 179002 kern.notice] %l0-3: 0000000000000038 0000000000000007 000000000194bd40 000000000194bc00 Jul 6 03:04:55 riggs %l4-7: 0000000000000001 0000030071bcb701 0000000000003006 0000000000003000 Jul 6 03:04:55 riggs genunix: [ID 723222 kern.notice] 000002a10e81b670 zfs:space_map_sync+278 (3009babd130, b, 3009babcfe0, 20, 4, 58) Jul 6 03:04:55 riggs genunix: [ID 179002 kern.notice] %l0-3: 0000000000000020 00000300b8c02800 00000300b8c02820 00000300b8c02858 Jul 6 03:04:55 riggs %l4-7: 00007fffffffffff 0000000000007fff 00000000000022d9 0000000000000020 Jul 6 03:04:55 riggs genunix: [ID 723222 kern.notice] 000002a10e81b760 zfs:metaslab_sync+2b0 (3009babcfc0, 1db7, 300f83414d0, 3009babd408, 300c9724000, 6003e24acc0) Jul 6 03:04:55 riggs genunix: [ID 179002 kern.notice] %l0-3: 00000300cbfd5c40 000003009babcff8 000003009babd130 000003009babd2d0 Jul 6 03:04:55 riggs %l4-7: 000003009babcfe0 0000000000000000 000003009babd268 000000000000001a Jul 6 03:04:55 riggs genunix: [ID 723222 kern.notice] 000002a10e81b820 zfs:vdev_sync+b8 (6003e24acc0, 1db7, 1db6, 3009babcfc0, 6003e24b000, 17) Jul 6 03:04:55 riggs genunix: [ID 179002 kern.notice] %l0-3: 0000000000000090 0000000000000012 000006003e24acc0 00000300c9724000 Jul 6 03:04:55 riggs %l4-7: 0000000000000000 0000000000000000 0000000000000000 00000009041ea000 Jul 6 03:04:55 riggs genunix: [ID 723222 kern.notice] 000002a10e81b8d0 zfs:spa_sync+484 (300c9724000, 1db7, 3005fec09a8, 300c9724428, 1, 300cbfd5c40) Jul 6 03:04:55 riggs genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 00000300c9724280 0000030087c3e940 00000300c6aae700 Jul 6 03:04:55 riggs %l4-7: 0000030080073520 00000300c9724378 00000300c9724300 00000300c9724330 Jul 6 03:04:55 riggs genunix: [ID 723222 kern.notice] 000002a10e81b9a0 zfs:txg_sync_thread+1b8 (30087c3e940, 183f9f0, 707a3130, 0, 2a10e81ba70, 0) Jul 6 03:04:55 riggs genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 0000030087c3eb0e 0000030087c3eb08 0000030087c3eb0c Jul 6 03:04:55 riggs %l4-7: 000000001230fa07 0000030087c3eac8 0000030087c3ead0 0000000000001db7
Jul  6 03:04:55 riggs unix: [ID 100000 kern.notice]

So, I guess my question is - is what I'm doing sane? Or is there something inherint with ZFS that I'm missing that's going to cause this kernel panic to repeat? Best I can guess, it got upset when the pools were being read. I'm wondering of exporting the pools later in the day before re-syncing the SAN volumes to mirrors is causing weirdness. (because makes the mirrored volumes visible on Test Server B read-only until the split). I wouldn't think so, because they're exported before the luns go read-only, but I could be wrong.

Anyway, am I off my rocker?  This should work with ZFS, right?

Thanks!
Brian

--
-----------------------------------------------------------------------------------
Brian Wilson, Solaris SE, UW-Madison DoIT
Room 3114 CS&S            608-263-8047
brian.wilson(a)doit.wisc.edu
'I try to save a life a day. Usually it's my own.' - John Crichton
-----------------------------------------------------------------------------------

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to