Hi!
I always wanted to know what "Detected action XXX from a different transition
..." really means:
Does it indicate a programming error in the cluster stack? To me it sounds as
if at least two parties try to control a thing without agreeing what is to be
done...
Regards,
Ulrich
>>> Mark Nipper <[email protected]> schrieb am 05.08.2013 um 19:34 in
Nachricht
<[email protected]>:
> One of our DRBD clusters has 47 LUN's being published.
> We're using RHEL 6.4. Here are the various package versions
> being used:
> ---
> pacemaker-1.1.7-6.el6.x86_64
> corosync-1.4.1-7.el6.x86_64
> resource-agents-3.9.2-12.el6.x86_64
> scsi-target-utils-1.0.24-2.el6.x86_64
>
> Somewhere after 40 LUN's we started experiencing monitor
> failures of the most recent LUN's added to the cluster. Things
> like:
> ---
> Jul 26 23:47:39 [8557] stor01a crmd: info: process_lrm_event:
> LRM operation lun47_monitor_10000 (call=357, rc=7, cib-update=6790,
> confirmed=false) not running
> Jul 26 23:47:39 [8557] stor01a crmd: info: process_graph_event:
> Detected action lun47_monitor_10000 from a different transition: 5737 vs.
> 5793
> Jul 26 23:47:39 [8557] stor01a crmd: info: abort_transition_graph:
> process_graph_event:476 - Triggered transition abort (complete=1,
> tag=lrm_rsc_op, id=lun47_last_failure_0,
> magic=0:7;192:5737:0:e16c8e9d-87ed-4132-a3b2-724a30b6cc73, cib=0.111.47) :
> Old event
> Jul 26 23:47:39 [8557] stor01a crmd: warning: update_failcount:
> Updating failcount for lun47 on stor01a after failed monitor: rc=7
> (update=value++, time=1374900459)
> Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-lun47 (1)
> Jul 26 23:47:39 [8557] stor01a crmd: notice: do_state_transition:
> State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_perform_update:
> Sent update 438: fail-count-lun47=1
> Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-lun47 (1374900459)
> Jul 26 23:47:39 [8557] stor01a crmd: info: abort_transition_graph:
> te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair,
> id=status-stor01a-fail-count-lun47, name=fail-count-lun47, value=1, magic=NA,
> cib=0.111.48) : Transient attribute: update
> Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_perform_update:
> Sent update 441: last-failure-lun47=1374900459
> ---
>
> So I decided to modify the resource agent as follows:
> ---
> --- iSCSILogicalUnit.orig 2013-08-05 12:15:03.185879119 -0500
> +++ iSCSILogicalUnit 2013-08-01 11:31:24.768133374 -0500
> @@ -305,12 +305,28 @@
> if [ -z "$TID" ]; then
> # Our target is not configured, thus we're not
> # running.
> + echo "$(date) TID not found: ${TID}." >> /var/log/iscsi-ra.log
> return $OCF_NOT_RUNNING
> fi
> # This only looks for the backing store, but does not test
> # for the correct target ID and LUN.
> - tgtadm --lld iscsi --op show --mode target \
> + tgt_output=$(tgtadm --lld iscsi --op show --mode target)
> + echo "$tgt_output" \
> | grep -E -q "[[:space:]]+Backing store.*:
> ${OCF_RESKEY_path}"
> && return $OCF_SUCCESS
> + echo "$(date) first LUN failure: ${OCF_RESKEY_path}" >>
> /var/log/iscsi-ra.log
> + echo "$tgt_output" >> /var/log/iscsi-ra.log
> + sleep 1
> + tgt_output=$(tgtadm --lld iscsi --op show --mode target)
> + echo "$tgt_output" \
> + | grep -E -q "[[:space:]]+Backing store.*:
> ${OCF_RESKEY_path}"
> && return $OCF_SUCCESS
> + echo "$(date) second LUN failure: ${OCF_RESKEY_path}" >>
> /var/log/iscsi-ra.log
> + echo "$tgt_output" >> /var/log/iscsi-ra.log
> + sleep 1
> + tgt_output=$(tgtadm --lld iscsi --op show --mode target)
> + echo "$tgt_output" \
> + | grep -E -q "[[:space:]]+Backing store.*:
> ${OCF_RESKEY_path}"
> && return $OCF_SUCCESS
> + echo "$(date) third LUN failure: ${OCF_RESKEY_path}" >>
> /var/log/iscsi-ra.log
> + echo "$tgt_output" >> /var/log/iscsi-ra.log
> ;;
> lio)
>
> configfs_path="/sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/tpgt
> _1/lun/lun_${OCF_RESKEY_lun}/${OCF_RESOURCE_INSTANCE}/udev_path"
> ---
>
> And over the weekend I got a hit from this. But it only
> failed the first time. The output from iscsi-ra.log:
> ---
> Sun Aug 4 10:54:41 CDT 2013 first LUN failure: /dev/stor01/vm-www01
> Target 1: iqn.2013-04.net.bitgnome:vh-storage01
> System information:
> Driver: iscsi
> State: ready
> I_T nexus information:
> I_T nexus: 17
> Initiator: iqn.1994-05.com.redhat:b8998f3aaa11
> Connection: 0
> IP Address: 172.16.165.18
> I_T nexus: 18
> Initiator: iqn.1994-05.com.redhat:36ad8852a96d
> Connection: 0
> IP Address: 172.16.165.19
> I_T nexus: 19
> Initiator: iqn.1994-05.com.redhat:28d6b194ab
> Connection: 0
> IP Address: 172.16.165.20
> I_T nexus: 20
> Initiator: iqn.1994-05.com.redhat:bc9afc47c4
> Connection: 0
> IP Address: 172.16.165.21
> LUN information:
> LUN: 0
> Type: controller
> SCSI ID: IET 00010000
> SCSI SN: beaf10
> Size: 0 MB, Block size: 1
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: null
> Backing store path: None
> Backing store flags:
> LUN: 1
> Type: disk
> SCSI ID: lun1
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-ldap1
> Backing store flags:
> LUN: 2
> Type: disk
> SCSI ID: lun2
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-arcgis
> Backing store flags:
> LUN: 3
> Type: disk
> SCSI ID: lun3
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-mail1
> Backing store flags:
> LUN: 4
> Type: disk
> SCSI ID: lun4
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-mail2
> Backing store flags:
> LUN: 5
> Type: disk
> SCSI ID: lun5
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-wp2
> Backing store flags:
> LUN: 6
> Type: disk
> SCSI ID: lun6
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-ldap-slave1
> Backing store flags:
> LUN: 7
> Type: disk
> SCSI ID: lun7
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-ldap-slave2
> Backing store flags:
> LUN: 8
> Type: disk
> SCSI ID: lun8
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-ldap-slave3
> Backing store flags:
> LUN: 9
> Type: disk
> SCSI ID: lun9
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-wp1
> Backing store flags:
> LUN: 10
> Type: disk
> SCSI ID: lun10
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-support
> Backing store flags:
> LUN: 11
> Type: disk
> SCSI ID: lun11
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-cache1
> Backing store flags:
> LUN: 12
> Type: disk
> SCSI ID: lun12
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-cache2
> Backing store flags:
> LUN: 13
> Type: disk
> SCSI ID: lun13
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-proxy
> Backing store flags:
> LUN: 14
> Type: disk
> SCSI ID: lun14
> SCSI SN: (stdin)=
> Size: 53687 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-pcspine
> Backing store flags:
> LUN: 15
> Type: disk
> SCSI ID: lun15
> SCSI SN: (stdin)=
> Size: 53687 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-print
> Backing store flags:
> LUN: 16
> Type: disk
> SCSI ID: lun16
> SCSI SN: (stdin)=
> Size: 53687 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-ad
> Backing store flags:
> LUN: 17
> Type: disk
> SCSI ID: lun17
> SCSI SN: (stdin)=
> Size: 53687 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-pcbrain
> Backing store flags:
> LUN: 18
> Type: disk
> SCSI ID: lun18
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-xmpp
> Backing store flags:
> LUN: 19
> Type: disk
> SCSI ID: lun19
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-pma
> Backing store flags:
> LUN: 20
> Type: disk
> SCSI ID: lun20
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-cake
> Backing store flags:
> LUN: 21
> Type: disk
> SCSI ID: lun21
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-ica-file
> Backing store flags:
> LUN: 22
> Type: disk
> SCSI ID: lun22
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-liwc
> Backing store flags:
> LUN: 23
> Type: disk
> SCSI ID: lun23
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-lasso
> Backing store flags:
> LUN: 24
> Type: disk
> SCSI ID: lun24
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-qt
> Backing store flags:
> LUN: 25
> Type: disk
> SCSI ID: lun25
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-public
> Backing store flags:
> LUN: 26
> Type: disk
> SCSI ID: lun26
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-source
> Backing store flags:
> LUN: 27
> Type: disk
> SCSI ID: lun27
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-gmc
> Backing store flags:
> LUN: 28
> Type: disk
> SCSI ID: lun28
> SCSI SN: (stdin)=
> Size: 21475 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-solr
> Backing store flags:
> LUN: 29
> Type: disk
> SCSI ID: lun29
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-license
> Backing store flags:
> LUN: 30
> Type: disk
> SCSI ID: lun30
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-media
> Backing store flags:
> LUN: 31
> Type: disk
> SCSI ID: lun31
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-opera
> Backing store flags:
> LUN: 32
> Type: disk
> SCSI ID: lun32
> SCSI SN: (stdin)=
> Size: 21475 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-asl
> Backing store flags:
> LUN: 33
> Type: disk
> SCSI ID: lun33
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-daseupload
> Backing store flags:
> LUN: 34
> Type: disk
> SCSI ID: lun34
> SCSI SN: (stdin)=
> Size: 21475 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-arcsde
> Backing store flags:
> LUN: 35
> Type: disk
> SCSI ID: lun35
> SCSI SN: (stdin)=
> Size: 21475 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-switchwitch
> Backing store flags:
> LUN: 36
> Type: disk
> SCSI ID: lun36
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-matlab
> Backing store flags:
> LUN: 37
> Type: disk
> SCSI ID: lun37
> SCSI SN: (stdin)=
> Size: 21475 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-spintx
> Backing store flags:
> LUN: 38
> Type: disk
> SCSI ID: lun38
> SCSI SN: (stdin)=
> Size: 21475 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-atlassian
> Backing store flags:
> LUN: 39
> Type: disk
> SCSI ID: lun39
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-test3
> Backing store flags:
> LUN: 40
> Type: disk
> SCSI ID: lun40
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-nfs
> Backing store flags:
> LUN: 41
> Type: disk
> SCSI ID: lun41
> SCSI SN: (stdin)=
> Size: 21475 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-test4
> Backing store flags:
> LUN: 42
> Type: disk
> SCSI ID: lun42
> SCSI SN: (stdin)=
> Size: 21475 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-bamboo
> Backing store flags:
> LUN: 43
> Type: disk
> SCSI ID: lun43
> SCSI SN: (stdin)=
> Size: 21475 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-wowza-test
> Backing store flags:
> LUN: 44
> Type: disk
> SCSI ID: lun44
> SCSI SN: (stdin)=
> Size: 53687 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-abman-dev
> Backing store flags:
> LUN: 45
> Type: disk
> SCSI ID: lun45
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-workflow
> Backing store flags:
> LUN: 46
> Type: disk
> SCSI ID: lun46
> SCSI SN: (stdin)=
> Size: 10737 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent removal: No
> Readonly: No
> Backing store type: rdwr
> Backing store path: /dev/stor01/vm-psyimage
> Backing store flags:
> LUN: 47
> Type: disk
> SCSI ID: lun47
> SCSI SN: (stdin)=
> Size: 21475 MB, Block size: 512
> Online: Yes
> Removable media: No
> Prevent
> ---
>
> So it clearly got incomplete output from tgtadm the first
> time and successfully retrieved all the information the second
> time before it returned a return code of 7. I found where tgtd
> would crash with more than 40 LUN's being discussed back in 2008:
> ---
> http://lists.wpkg.org/pipermail/stgt/2008-December/002528.html
>
> But I couldn't find anything else related to this problem
> specifically.
>
> Has anyone else seen weirdness like this from tgtd? I
> assume the "easy" answer is switch to a newer distribution with
> LIO. Or just keep the multiple checks in place to workaround the
> problem.
>
> --
> Mark Nipper
> [email protected] (XMPP)
> +1 979 575 3193
> -
> In theory there is no difference between theory and practice. In
> practice there is.
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems