> Hello Michael,
>
> Am 24.1.2007 14:36 Uhr, Michael Schuster schrieb:
>
>>> --------------------------------------------------------------
>>> [EMAIL PROTECTED] # zpool status
>>>   pool: pool0
>>>  state: ONLINE
>>>  scrub: none requested
>>> config:
>>
>> [...]
>>
>>> Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600:
>>> Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic due to
>>> lack of DiskSuite state
>>> Jan 23 18:51:38 newponit  database replicas. Fewer than 50% of the total
>>> were available,
>>> Jan 23 18:51:38 newponit  so panic to ensure data integrity.
>>
>> this message shows (and the rest of the stack prove) that your panic
>> happened in SVM. It has NOTHING to do with zfs. So either you pulled the
>> wrong disk, or the disk you pulled also contained SVM volumes (next to
>> ZFS).
>
> I noticed that the panic was in SVM and I'm wondering, why the machine
> was hanging. SVM is only running on the internal disks (c0) and I pulled
> a disk from the D1000:

   so the device that was affected had nothing to do with SVM at all.

   fine ... I have the exact same cconfig here.  Internal SVM and
  then external ZFS on two disk arrays on two controllers.

> Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING:
> /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd50):
> Jan 23 17:24:14 newponit      SCSI transport failed: reason 'incomplete':
> retrying command
> Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING:
> /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd50):
> Jan 23 17:24:14 newponit      disk not responding to selection
> Jan 23 17:24:18 newponit scsi: [ID 107833 kern.warning] WARNING:
> /[EMAIL PROTECTED],4000/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd50):
> Jan 23 17:24:18 newponit      disk not responding to selection
>
> This is clearly the disk with ZFS on it: SVM has nothing to do with this
> disk. A minute later, the troubles started with the internal disks:

  OKay .. so are we back to looking at ZFS or ZFS and the SVM components or
some interaction between these kernel modules.  At this point I have to be
careful not to fall into a pit of blind ignorance as I grobe for the
answer.  Perhaps some data would help.  Was there a core file in
/var/crash/newponit ?

> Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /[EMAIL 
> PROTECTED],4000/[EMAIL PROTECTED]
> (glm0):
> Jan 23 17:25:26 newponit      Cmd (0x60000a3ed10) dump for Target 0 Lun 0:
> Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /[EMAIL 
> PROTECTED],4000/[EMAIL PROTECTED]
> (glm0):
> Jan 23 17:25:26 newponit              cdb=[ 0x28 0x0 0x0 0x78 0x6 0x30 0x0 
> 0x0 0x10
> 0x0 ]
> Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /[EMAIL 
> PROTECTED],4000/[EMAIL PROTECTED]
> (glm0):
> Jan 23 17:25:26 newponit      pkt_flags=0x4000 pkt_statistics=0x60 
> pkt_state=0x7
> Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /[EMAIL 
> PROTECTED],4000/[EMAIL PROTECTED]
> (glm0):
> Jan 23 17:25:26 newponit      pkt_scbp=0x0 cmd_flags=0x860
> Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING:
> /[EMAIL PROTECTED],4000/[EMAIL PROTECTED] (glm0):
> Jan 23 17:25:26 newponit      Disconnected tagged cmd(s) (1) timeout for
> Target 0.0

   so a pile of scsi noise above there .. one would expect that from a
 suddenly missing scsi device.

> Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0:
> fault detected in device; service still available
> Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0:
> Disconnected tagged cmd(s) (1) timeout for Target 0.0

  NCR scsi controllers .. what OS revision is this ?   Solaris 10 u 3 ?

  Solaris Nevada snv_55b ?

> Jan 23 17:25:26 newponit glm: [ID 401478 kern.warning] WARNING:
> ID[SUNWpd.glm.cmd_timeout.6018]
> Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING:
> /[EMAIL PROTECTED],4000/[EMAIL PROTECTED] (glm0):
> Jan 23 17:25:26 newponit      got SCSI bus reset
> Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0:
> fault detected in device; service still available
>
> SVM and ZFS disks are on a seperate SCSI bus, so theoretically there
> should be any impact on the SVM disks when I pull out a ZFS disk.

  I still feel that you hit a bug in ZFS somewhere.  Under no circumstances
should a Solaris server panic and crash simply because you pulled out a
single disk that was totally mirrored.  In fact .. I will reproduce those
conditions here and then see what happens for me.

Dennis

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to