One of our DRBD clusters has 47 LUN's being published.
We're using RHEL 6.4. Here are the various package versions
being used:
---
pacemaker-1.1.7-6.el6.x86_64
corosync-1.4.1-7.el6.x86_64
resource-agents-3.9.2-12.el6.x86_64
scsi-target-utils-1.0.24-2.el6.x86_64
Somewhere after 40 LUN's we started experiencing monitor
failures of the most recent LUN's added to the cluster. Things
like:
---
Jul 26 23:47:39 [8557] stor01a crmd: info: process_lrm_event:
LRM operation lun47_monitor_10000 (call=357, rc=7, cib-update=6790,
confirmed=false) not running
Jul 26 23:47:39 [8557] stor01a crmd: info: process_graph_event:
Detected action lun47_monitor_10000 from a different transition: 5737 vs. 5793
Jul 26 23:47:39 [8557] stor01a crmd: info: abort_transition_graph:
process_graph_event:476 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=lun47_last_failure_0,
magic=0:7;192:5737:0:e16c8e9d-87ed-4132-a3b2-724a30b6cc73, cib=0.111.47) : Old
event
Jul 26 23:47:39 [8557] stor01a crmd: warning: update_failcount:
Updating failcount for lun47 on stor01a after failed monitor: rc=7
(update=value++, time=1374900459)
Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-lun47 (1)
Jul 26 23:47:39 [8557] stor01a crmd: notice: do_state_transition:
State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_perform_update:
Sent update 438: fail-count-lun47=1
Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-lun47 (1374900459)
Jul 26 23:47:39 [8557] stor01a crmd: info: abort_transition_graph:
te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair,
id=status-stor01a-fail-count-lun47, name=fail-count-lun47, value=1, magic=NA,
cib=0.111.48) : Transient attribute: update
Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_perform_update:
Sent update 441: last-failure-lun47=1374900459
---
So I decided to modify the resource agent as follows:
---
--- iSCSILogicalUnit.orig 2013-08-05 12:15:03.185879119 -0500
+++ iSCSILogicalUnit 2013-08-01 11:31:24.768133374 -0500
@@ -305,12 +305,28 @@
if [ -z "$TID" ]; then
# Our target is not configured, thus we're not
# running.
+ echo "$(date) TID not found: ${TID}." >> /var/log/iscsi-ra.log
return $OCF_NOT_RUNNING
fi
# This only looks for the backing store, but does not test
# for the correct target ID and LUN.
- tgtadm --lld iscsi --op show --mode target \
+ tgt_output=$(tgtadm --lld iscsi --op show --mode target)
+ echo "$tgt_output" \
| grep -E -q "[[:space:]]+Backing store.*: ${OCF_RESKEY_path}"
&& return $OCF_SUCCESS
+ echo "$(date) first LUN failure: ${OCF_RESKEY_path}" >>
/var/log/iscsi-ra.log
+ echo "$tgt_output" >> /var/log/iscsi-ra.log
+ sleep 1
+ tgt_output=$(tgtadm --lld iscsi --op show --mode target)
+ echo "$tgt_output" \
+ | grep -E -q "[[:space:]]+Backing store.*: ${OCF_RESKEY_path}"
&& return $OCF_SUCCESS
+ echo "$(date) second LUN failure: ${OCF_RESKEY_path}" >>
/var/log/iscsi-ra.log
+ echo "$tgt_output" >> /var/log/iscsi-ra.log
+ sleep 1
+ tgt_output=$(tgtadm --lld iscsi --op show --mode target)
+ echo "$tgt_output" \
+ | grep -E -q "[[:space:]]+Backing store.*: ${OCF_RESKEY_path}"
&& return $OCF_SUCCESS
+ echo "$(date) third LUN failure: ${OCF_RESKEY_path}" >>
/var/log/iscsi-ra.log
+ echo "$tgt_output" >> /var/log/iscsi-ra.log
;;
lio)
configfs_path="/sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/tpgt_1/lun/lun_${OCF_RESKEY_lun}/${OCF_RESOURCE_INSTANCE}/udev_path"
---
And over the weekend I got a hit from this. But it only
failed the first time. The output from iscsi-ra.log:
---
Sun Aug 4 10:54:41 CDT 2013 first LUN failure: /dev/stor01/vm-www01
Target 1: iqn.2013-04.net.bitgnome:vh-storage01
System information:
Driver: iscsi
State: ready
I_T nexus information:
I_T nexus: 17
Initiator: iqn.1994-05.com.redhat:b8998f3aaa11
Connection: 0
IP Address: 172.16.165.18
I_T nexus: 18
Initiator: iqn.1994-05.com.redhat:36ad8852a96d
Connection: 0
IP Address: 172.16.165.19
I_T nexus: 19
Initiator: iqn.1994-05.com.redhat:28d6b194ab
Connection: 0
IP Address: 172.16.165.20
I_T nexus: 20
Initiator: iqn.1994-05.com.redhat:bc9afc47c4
Connection: 0
IP Address: 172.16.165.21
LUN information:
LUN: 0
Type: controller
SCSI ID: IET 00010000
SCSI SN: beaf10
Size: 0 MB, Block size: 1
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: null
Backing store path: None
Backing store flags:
LUN: 1
Type: disk
SCSI ID: lun1
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-ldap1
Backing store flags:
LUN: 2
Type: disk
SCSI ID: lun2
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-arcgis
Backing store flags:
LUN: 3
Type: disk
SCSI ID: lun3
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-mail1
Backing store flags:
LUN: 4
Type: disk
SCSI ID: lun4
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-mail2
Backing store flags:
LUN: 5
Type: disk
SCSI ID: lun5
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-wp2
Backing store flags:
LUN: 6
Type: disk
SCSI ID: lun6
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-ldap-slave1
Backing store flags:
LUN: 7
Type: disk
SCSI ID: lun7
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-ldap-slave2
Backing store flags:
LUN: 8
Type: disk
SCSI ID: lun8
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-ldap-slave3
Backing store flags:
LUN: 9
Type: disk
SCSI ID: lun9
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-wp1
Backing store flags:
LUN: 10
Type: disk
SCSI ID: lun10
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-support
Backing store flags:
LUN: 11
Type: disk
SCSI ID: lun11
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-cache1
Backing store flags:
LUN: 12
Type: disk
SCSI ID: lun12
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-cache2
Backing store flags:
LUN: 13
Type: disk
SCSI ID: lun13
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-proxy
Backing store flags:
LUN: 14
Type: disk
SCSI ID: lun14
SCSI SN: (stdin)=
Size: 53687 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-pcspine
Backing store flags:
LUN: 15
Type: disk
SCSI ID: lun15
SCSI SN: (stdin)=
Size: 53687 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-print
Backing store flags:
LUN: 16
Type: disk
SCSI ID: lun16
SCSI SN: (stdin)=
Size: 53687 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-ad
Backing store flags:
LUN: 17
Type: disk
SCSI ID: lun17
SCSI SN: (stdin)=
Size: 53687 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-pcbrain
Backing store flags:
LUN: 18
Type: disk
SCSI ID: lun18
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-xmpp
Backing store flags:
LUN: 19
Type: disk
SCSI ID: lun19
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-pma
Backing store flags:
LUN: 20
Type: disk
SCSI ID: lun20
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-cake
Backing store flags:
LUN: 21
Type: disk
SCSI ID: lun21
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-ica-file
Backing store flags:
LUN: 22
Type: disk
SCSI ID: lun22
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-liwc
Backing store flags:
LUN: 23
Type: disk
SCSI ID: lun23
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-lasso
Backing store flags:
LUN: 24
Type: disk
SCSI ID: lun24
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-qt
Backing store flags:
LUN: 25
Type: disk
SCSI ID: lun25
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-public
Backing store flags:
LUN: 26
Type: disk
SCSI ID: lun26
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-source
Backing store flags:
LUN: 27
Type: disk
SCSI ID: lun27
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-gmc
Backing store flags:
LUN: 28
Type: disk
SCSI ID: lun28
SCSI SN: (stdin)=
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-solr
Backing store flags:
LUN: 29
Type: disk
SCSI ID: lun29
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-license
Backing store flags:
LUN: 30
Type: disk
SCSI ID: lun30
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-media
Backing store flags:
LUN: 31
Type: disk
SCSI ID: lun31
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-opera
Backing store flags:
LUN: 32
Type: disk
SCSI ID: lun32
SCSI SN: (stdin)=
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-asl
Backing store flags:
LUN: 33
Type: disk
SCSI ID: lun33
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-daseupload
Backing store flags:
LUN: 34
Type: disk
SCSI ID: lun34
SCSI SN: (stdin)=
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-arcsde
Backing store flags:
LUN: 35
Type: disk
SCSI ID: lun35
SCSI SN: (stdin)=
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-switchwitch
Backing store flags:
LUN: 36
Type: disk
SCSI ID: lun36
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-matlab
Backing store flags:
LUN: 37
Type: disk
SCSI ID: lun37
SCSI SN: (stdin)=
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-spintx
Backing store flags:
LUN: 38
Type: disk
SCSI ID: lun38
SCSI SN: (stdin)=
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-atlassian
Backing store flags:
LUN: 39
Type: disk
SCSI ID: lun39
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-test3
Backing store flags:
LUN: 40
Type: disk
SCSI ID: lun40
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-nfs
Backing store flags:
LUN: 41
Type: disk
SCSI ID: lun41
SCSI SN: (stdin)=
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-test4
Backing store flags:
LUN: 42
Type: disk
SCSI ID: lun42
SCSI SN: (stdin)=
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-bamboo
Backing store flags:
LUN: 43
Type: disk
SCSI ID: lun43
SCSI SN: (stdin)=
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-wowza-test
Backing store flags:
LUN: 44
Type: disk
SCSI ID: lun44
SCSI SN: (stdin)=
Size: 53687 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-abman-dev
Backing store flags:
LUN: 45
Type: disk
SCSI ID: lun45
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-workflow
Backing store flags:
LUN: 46
Type: disk
SCSI ID: lun46
SCSI SN: (stdin)=
Size: 10737 MB, Block size: 512
Online: Yes
Removable media: No
Prevent removal: No
Readonly: No
Backing store type: rdwr
Backing store path: /dev/stor01/vm-psyimage
Backing store flags:
LUN: 47
Type: disk
SCSI ID: lun47
SCSI SN: (stdin)=
Size: 21475 MB, Block size: 512
Online: Yes
Removable media: No
Prevent
---
So it clearly got incomplete output from tgtadm the first
time and successfully retrieved all the information the second
time before it returned a return code of 7. I found where tgtd
would crash with more than 40 LUN's being discussed back in 2008:
---
http://lists.wpkg.org/pipermail/stgt/2008-December/002528.html
But I couldn't find anything else related to this problem
specifically.
Has anyone else seen weirdness like this from tgtd? I
assume the "easy" answer is switch to a newer distribution with
LIO. Or just keep the multiple checks in place to workaround the
problem.
--
Mark Nipper
[email protected] (XMPP)
+1 979 575 3193
-
In theory there is no difference between theory and practice. In
practice there is.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems