[Bug 158288] Re: Node hangs at clvm when joining cluster

Bug Watch Updater Thu, 26 Oct 2017 13:38:30 -0700

Launchpad has imported 23 comments from the remote bug at
https://bugzilla.redhat.com/show_bug.cgi?id=338511.


If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.

------------------------------------------------------------------------
On 2007-10-18T17:51:24+00:00 Tom wrote:

Description of problem:  Boot 4 node cluster. Clvmd starts and all volumes are
mounted. Then reboot any of the nodes, the node exits the cluster nornally. On
the node startup, clvmd fails to start and the system hangs there. 
        The only way for the rebooted system to start clvmd successfully is to
reboot the whole cluster

Version-Release number of selected component (if applicable):

lvm2-cluster-debuginfo-2.02.21-7.el4
lvm2-cluster-2.02.21-7.el4
lvm2-2.02.21-5.el4

How reproducible: Every Time


Steps to Reproduce:
1. First boot up cluster (4 node) everything starts nornally. No problems
2. Reboot node, there is no  I/O / testing happening
3. Cluster (ccs,cman,fence,qdisk) all start
4. Clvmd fails to start and hangs. Node never comes up  

Actual results: All other nodes report the node to be rebooted. Then does not
report the node joining the cluster

Oct 18 12:41:05 et-virt08 kernel: CMAN: node et-virt10.lab.boston.redhat.com has
been removed from the cluster : Missed too many heartbeats
Oct 18 12:41:06 et-virt08 fenced[6664]: et-virt10.lab.boston.redhat.com not a
cluster member after 0 sec post_fail_delay
Oct 18 12:41:06 et-virt08 fenced[6664]: fencing node
"et-virt10.lab.boston.redhat.com"
Oct 18 12:41:15 et-virt08 fenced[6664]: fence "et-virt10.lab.boston.redhat.com"
success


Expected results: Clvmd should start and activate volumes so GFS can mount the
volumes.


Additional info:

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/0

------------------------------------------------------------------------
On 2007-10-18T17:51:24+00:00 Tom wrote:

Created attachment 231321
Console of the node rebooting

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/1

------------------------------------------------------------------------
On 2007-10-23T23:03:09+00:00 Corey wrote:

Lon and myself took a look at this cluster and the first thing we saw was that
clvmd was taking up 80Mb on two nodes. We then rebooted the whole cluster and
noticed that the root drive on et-virt09 appears to be bad. We started clvmd on
all nodes by hand and rebooted et-virt10 afterwards. When it was back up, we
reproduced this bug when attempting to start clvmd on et-virt10. 
We then rebooted all again, turned on all debugging in the lvm.conf file, and
once again attempted to cause this bug, but unfortunately it didn't happen.


Reply at: 
https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/comments/2

------------------------------------------------------------------------
On 2007-10-24T13:44:58+00:00 Lon wrote:

lvs is stuck talking to clvmd:

(gdb) bt
#0  0x00000036c980b162 in __read_nocancel () from /lib64/tls/libpthread.so.0
#1  0x000000000044efd3 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags"
is not available.
)
    at locking/cluster_locking.c:115
#2  0x000000000044f2c6 in _lock_resource (cmd=Variable "cmd" is not available.
)
    at locking/cluster_locking.c:410
#3  0x000000000043b646 in _lock_vol (cmd=0x684240, 
    resource=0x7fbfffb450 "oradbnas", flags=33) at locking/locking.c:237
#4  0x000000000043b859 in lock_vol (cmd=0x684240, vol=0x69eeb0 "oradbnas", 
    flags=33) at locking/locking.c:270
#5  0x000000000041e87f in process_each_lv (cmd=0x684240, argc=Variable "argc" is
not available.
)
    at toollib.c:324
#6  0x000000000041cd7a in _report (cmd=0x684240, argc=0, argv=0x7fbffff970, 
    report_type=Variable "report_type" is not available.
) at reporter.c:329
#7  0x0000000000414710 in lvm_run_command (cmd=0x684240, argc=0, 
    argv=0x7fbffff970) at lvmcmdline.c:935
#8  0x0000000000415492 in lvm2_main (argc=1, argv=0x7fbffff968,
is_static=Variable "is_static" is not available.
)
    at lvmcmdline.c:1423
#9  0x00000036c911c3fb in __libc_start_main () from /lib64/tls/libc.so.6
#10 0x000000000040c42a in _start ()
#11 0x0000007fbffff958 in ?? ()
#12 0x000000000000001c in ?? ()
#13 0x0000000000000001 in ?? ()
#14 0x0000007fbffffb4a in ?? ()
#15 0x0000000000000000 in ?? ()

==================================

clvmd is stuck in kernel-mode in the DLM:

clvmd         D 000001013a47f8d8     0  6276      1                7499 (NOTLB)
000001010c9b1da8 0000000000000002 ffffffffa0290457 000001010000006a 
       0000000000000000 0000000000000001 000001000104da80 000000018013250d
       0000010135e007f0 0000000000001ccb 
Call Trace:<ffffffffa0290457>{:dlm:dlm_recoverd+0}
<ffffffff8030c72d>{wait_for_completion+167} 
       <ffffffff8013416c>{default_wake_function+0}
<ffffffff8013416c>{default_wake_function+0} 
       <ffffffffa026c3f1>{:cman:kcl_join_service+381}
<ffffffffa0287ddb>{:dlm:dlm_new_lockspace+1418}
       <ffffffffa0282878>{:dlm:dlm_write+2060} 
<ffffffff8017a4b2>{vfs_write+207} 
       <ffffffff8017a59a>{sys_write+69} <ffffffff8011026a>{system_call+126}

... digging deeper.

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/3

------------------------------------------------------------------------
On 2007-10-24T13:45:53+00:00 Lon wrote:

Note on boot, the dlm whines about "connect from non-cluster node".  I haven't
seen this message before, and I wonder if it's related to a class of bugs in
CMAN on RHEL5 where node-0 (i.e. qdisk) is getting treated as a node.  I didn't
think this was existent on RHEL4, but I'll double check.

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/4

------------------------------------------------------------------------
On 2007-10-24T13:52:38+00:00 Lon wrote:

Created attachment 236221
Full dmesg

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/5

------------------------------------------------------------------------
On 2007-10-24T13:52:52+00:00 Lon wrote:

(including stack traces)

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/6

------------------------------------------------------------------------
On 2007-10-24T13:56:47+00:00 Lon wrote:

It's waiting for the clvmd lockspace to finish joining:

DLM Lock Space:  "clvmd"                             3   3 join      S-6,20,3
[4 1 3]

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/7

------------------------------------------------------------------------
On 2007-10-24T15:04:39+00:00 Lon wrote:

CMAN nodes:

crash> p *(struct cluster_node *)0x1010d799c00
$35 = {
  list = {
    next = 0x1010d799880, 
    prev = 0x1013959e300
  }, 
  name = 0x1013b621280 "et-virt08.lab.boston.redhat.com", 
  addr_list = {
    next = 0x10005f68d00, 
    prev = 0x10005f68d00
  }, 
  us = 0, 
  node_id = 1, 
  state = NODESTATE_MEMBER, 
  last_seq_recv = 1629, 
  last_ackneeded_seq_recv = 603, 
  last_seq_acked = 18, 
  last_seq_sent = 18, 
  votes = 1, 
  expected_votes = 3, 
  leave_reason = 0, 
  incarnation = 0, 
  last_hello = 4299878162, 
  join_time = {
    tv_sec = 1193232725, 
    tv_usec = 798085
  }
}
crash> p *(struct cluster_node *)0x1010d799880
$36 = {
  list = {
    next = 0x1010d4c3b00, 
    prev = 0x1010d799c00
  }, 
  name = 0x1013b621520 "et-virt09.lab.boston.redhat.com", 
  addr_list = {
    next = 0x10005f5ed80, 
    prev = 0x10005f5ed80
  }, 
  us = 0, 
  node_id = 2, 
  state = NODESTATE_DEAD, 
  last_seq_recv = 0, 
  last_ackneeded_seq_recv = 0, 
  last_seq_acked = 0, 
  last_seq_sent = 0, 
  votes = 1, 
  expected_votes = 3, 
  leave_reason = 0, 
  incarnation = 0, 
  last_hello = 4294749092, 
  join_time = {
    tv_sec = 1193232725, 
    tv_usec = 798082
  }
}
crash> p *(struct cluster_node *)0x1010d4c3b00
$37 = {
  list = {
    next = 0xffffffffa027ce40, 
    prev = 0x1010d799880
  }, 
  name = 0x1013b6214c0 "et-virt10.lab.boston.redhat.com", 
  addr_list = {
    next = 0x10005f68c40, 
    prev = 0x10005f68c40
  }, 
  us = 1, 
  node_id = 3, 
  state = NODESTATE_MEMBER, 
  last_seq_recv = 0, 
  last_ackneeded_seq_recv = 0, 
  last_seq_acked = 0, 
  last_seq_sent = 14, 
  votes = 1, 
  expected_votes = 3, 
  leave_reason = 0, 
  incarnation = 0, 
  last_hello = 4294749093, 
  join_time = {
    tv_sec = 1193232723, 
    tv_usec = 110389
  }
}

Quorum device node:

crash> p *(struct cluster_node *)0x000001010cee2a80
$38 = {
  list = {
    next = 0x0, 
    prev = 0x0
  }, 
  name = 0x101341a08e0 "/dev/sdh", 
  addr_list = {
    next = 0x1010cee2a98, 
    prev = 0x1010cee2a98
  }, 
  us = 0, 
  node_id = 0, 
  state = NODESTATE_MEMBER, 
  last_seq_recv = 0, 
  last_ackneeded_seq_recv = 0, 
  last_seq_acked = 0, 
  last_seq_sent = 0, 
  votes = 3, 
  expected_votes = 0, 
  leave_reason = 0, 
  incarnation = 0, 
  last_hello = 4300014856, 
  join_time = {
    tv_sec = 0, 
    tv_usec = 0
  }
}


Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/8

------------------------------------------------------------------------
On 2007-10-24T17:35:45+00:00 Lon wrote:

crash> set 6280
    PID: 6280
COMMAND: "dlm_recoverd"
   TASK: 1010c9b67f0  [THREAD_INFO: 1010ca08000]
    CPU: 0
  STATE: TASK_INTERRUPTIBLE 
crash> bt
PID: 6280   TASK: 1010c9b67f0       CPU: 0   COMMAND: "dlm_recoverd"
 #0 [1010ca09c28] schedule at ffffffff8030c4e9
 #1 [1010ca09d00] dlm_wait_function at ffffffffa028f2f1
 #2 [1010ca09d60] __wake_up at ffffffff80134213
 #3 [1010ca09dc0] rcom_send_message at ffffffffa028ef3e
 #4 [1010ca09e00] dlm_wait_status_low at ffffffffa028f441
 #5 [1010ca09e50] nodes_reconfig_wait at ffffffffa028ac98
 #6 [1010ca09e70] ls_nodes_init at ffffffffa028b0cc
 #7 [1010ca09eb0] dlm_recoverd at ffffffffa0291179
 #8 [1010ca09f20] kthread at ffffffff8014b907
 #9 [1010ca09f50] kernel_thread at ffffffff80110f47


Reply at: 
https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/comments/9

------------------------------------------------------------------------
On 2007-10-24T20:07:12+00:00 Lon wrote:

Created attachment 236581
cluster.conf

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/10

------------------------------------------------------------------------
On 2007-10-25T20:31:05+00:00 Lon wrote:

=== Output from et-virt08 ===
dlm: connection from 344fa8c0
dlm: connect from non cluster node
dlm: connection from 344fa8c0
dlm: connect from non cluster node
dlm: connection from 364fa8c0
dlm: connect from non cluster node
dlm: connection from 354fa8c0
dlm: connect from non cluster node

344fa8c0 = 192.168.79.52
354fa8c0 = 192.168.79.53
364fa8c0 = 192.168.79.54

The other nodes in the cluster are routing packets out from the wrong IP
addresses.  The DLM needs to source-route the packets from the correct IPs, or
we need to ensure all packets get routed out the right IP address.


Reply at: 
https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/comments/11

------------------------------------------------------------------------
On 2007-10-25T20:32:04+00:00 Lon wrote:

On the test cluster, everything works with a clean boot - this is because Oracle
isn't running yet.  Once Oracle starts up, it adds several VIPs to the
interfaces the cluster is using to communicate.

This causes routing problems.

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/12

------------------------------------------------------------------------
On 2007-10-25T20:33:47+00:00 Lon wrote:

When we reboot the machine, the nodes are supposed to be sourcing the packets
from their node IP address, but this isn't working correctly.  This in turn
means the packets are being sourced from another IP address on the interface,
which isn't necessarily the node IP address.  This causes the DLM to reject the
connections, causing the hang.


Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/13

------------------------------------------------------------------------
On 2007-10-25T21:44:15+00:00 Lon wrote:

Created attachment 238121
Fix, pass 1

Source-routes the connect from the local_addr that we got from CMAN rather than
letting the kernel decide where to source the packets from.

This eliminates the need to work around the problem using routing or
iptables.

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/14

------------------------------------------------------------------------
On 2007-10-25T21:59:25+00:00 Lon wrote:

A similar patch could be made for upstream too.

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/15

------------------------------------------------------------------------
On 2007-10-26T14:14:43+00:00 Lon wrote:

Patch for upstream submitted to cluster-devel.  Only works for TCP; sctp may or
may not need a similar patch.

https://www.redhat.com/archives/cluster-devel/2007-October/msg00220.html

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/16

------------------------------------------------------------------------
On 2007-10-26T14:47:13+00:00 Lon wrote:

You can tinker around this using IPtables too.  On *all* nodes, do
something like:

iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \
         -m tcp -p tcp -d <node_1_ip_addr> \
         -j SNAT --to-source <my_cluster_ip>
iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \
         -m tcp -p tcp -d <node_1_ip_addr> \
         -j SNAT --to-source <my_cluster_ip>
...
iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \
         -m tcp -p tcp -d <node_N_ip_addr> \
         -j SNAT --to-source <my_cluster_ip>

Ex:

[root@et-virt09 ~]# iptables -t nat -A POSTROUTING -s 192.168.76.0/22 \
                   -m tcp -p tcp -d 192.168.79.125 \
                   -j SNAT --to-source 192.168.79.94


Reply at: 
https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/comments/17

------------------------------------------------------------------------
On 2007-10-26T20:42:24+00:00 Lon wrote:

Note that this only seems to happen if multiple "non-secondary" addresses are
showing up in 'ip addr list'.

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/18

------------------------------------------------------------------------
On 2007-10-30T17:42:35+00:00 Lon wrote:

Patch in CVS.

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/21

------------------------------------------------------------------------
On 2007-10-30T17:44:34+00:00 Lon wrote:

Patch in RHEL4 + RHEL46 branches.

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/22

------------------------------------------------------------------------
On 2007-11-09T21:24:30+00:00 Lon wrote:

Updated workaround:

You can tinker around this using IPtables too.  On *all* nodes, do
something like:

iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \
         -m tcp -p tcp -d <node_1_ip_addr> \
         -j SNAT --to-source <my_cluster_ip>
iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \
         -m tcp -p tcp -d <node_2_ip_addr> \
         -j SNAT --to-source <my_cluster_ip>
...
iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \
         -m tcp -p tcp -d <node_N_ip_addr> \
         -j SNAT --to-source <my_cluster_ip>

Ex:

[root@et-virt09 ~]# iptables -t nat -A POSTROUTING -s 192.168.76.0/22 \
                   -m tcp -p tcp -d 192.168.79.125 \
                   -j SNAT --to-source 192.168.79.94

Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-
suite/+bug/158288/comments/23

------------------------------------------------------------------------
On 2007-11-21T21:56:01+00:00 errata-xmlrpc wrote:


An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0995.html


Reply at: 
https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/comments/28


** Changed in: redhat-cluster-suite (Fedora)
   Importance: Unknown => Medium

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/158288

Title:
  Node hangs at clvm when joining cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 158288] Re: Node hangs at clvm when joining cluster

Reply via email to