Launchpad has imported 23 comments from the remote bug at https://bugzilla.redhat.com/show_bug.cgi?id=338511.
If you reply to an imported comment from within Launchpad, your comment will be sent to the remote bug automatically. Read more about Launchpad's inter-bugtracker facilities at https://help.launchpad.net/InterBugTracking. ------------------------------------------------------------------------ On 2007-10-18T17:51:24+00:00 Tom wrote: Description of problem: Boot 4 node cluster. Clvmd starts and all volumes are mounted. Then reboot any of the nodes, the node exits the cluster nornally. On the node startup, clvmd fails to start and the system hangs there. The only way for the rebooted system to start clvmd successfully is to reboot the whole cluster Version-Release number of selected component (if applicable): lvm2-cluster-debuginfo-2.02.21-7.el4 lvm2-cluster-2.02.21-7.el4 lvm2-2.02.21-5.el4 How reproducible: Every Time Steps to Reproduce: 1. First boot up cluster (4 node) everything starts nornally. No problems 2. Reboot node, there is no I/O / testing happening 3. Cluster (ccs,cman,fence,qdisk) all start 4. Clvmd fails to start and hangs. Node never comes up Actual results: All other nodes report the node to be rebooted. Then does not report the node joining the cluster Oct 18 12:41:05 et-virt08 kernel: CMAN: node et-virt10.lab.boston.redhat.com has been removed from the cluster : Missed too many heartbeats Oct 18 12:41:06 et-virt08 fenced[6664]: et-virt10.lab.boston.redhat.com not a cluster member after 0 sec post_fail_delay Oct 18 12:41:06 et-virt08 fenced[6664]: fencing node "et-virt10.lab.boston.redhat.com" Oct 18 12:41:15 et-virt08 fenced[6664]: fence "et-virt10.lab.boston.redhat.com" success Expected results: Clvmd should start and activate volumes so GFS can mount the volumes. Additional info: Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/0 ------------------------------------------------------------------------ On 2007-10-18T17:51:24+00:00 Tom wrote: Created attachment 231321 Console of the node rebooting Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/1 ------------------------------------------------------------------------ On 2007-10-23T23:03:09+00:00 Corey wrote: Lon and myself took a look at this cluster and the first thing we saw was that clvmd was taking up 80Mb on two nodes. We then rebooted the whole cluster and noticed that the root drive on et-virt09 appears to be bad. We started clvmd on all nodes by hand and rebooted et-virt10 afterwards. When it was back up, we reproduced this bug when attempting to start clvmd on et-virt10. We then rebooted all again, turned on all debugging in the lvm.conf file, and once again attempted to cause this bug, but unfortunately it didn't happen. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/comments/2 ------------------------------------------------------------------------ On 2007-10-24T13:44:58+00:00 Lon wrote: lvs is stuck talking to clvmd: (gdb) bt #0 0x00000036c980b162 in __read_nocancel () from /lib64/tls/libpthread.so.0 #1 0x000000000044efd3 in _lock_for_cluster (cmd=51 '3', flags=Variable "flags" is not available. ) at locking/cluster_locking.c:115 #2 0x000000000044f2c6 in _lock_resource (cmd=Variable "cmd" is not available. ) at locking/cluster_locking.c:410 #3 0x000000000043b646 in _lock_vol (cmd=0x684240, resource=0x7fbfffb450 "oradbnas", flags=33) at locking/locking.c:237 #4 0x000000000043b859 in lock_vol (cmd=0x684240, vol=0x69eeb0 "oradbnas", flags=33) at locking/locking.c:270 #5 0x000000000041e87f in process_each_lv (cmd=0x684240, argc=Variable "argc" is not available. ) at toollib.c:324 #6 0x000000000041cd7a in _report (cmd=0x684240, argc=0, argv=0x7fbffff970, report_type=Variable "report_type" is not available. ) at reporter.c:329 #7 0x0000000000414710 in lvm_run_command (cmd=0x684240, argc=0, argv=0x7fbffff970) at lvmcmdline.c:935 #8 0x0000000000415492 in lvm2_main (argc=1, argv=0x7fbffff968, is_static=Variable "is_static" is not available. ) at lvmcmdline.c:1423 #9 0x00000036c911c3fb in __libc_start_main () from /lib64/tls/libc.so.6 #10 0x000000000040c42a in _start () #11 0x0000007fbffff958 in ?? () #12 0x000000000000001c in ?? () #13 0x0000000000000001 in ?? () #14 0x0000007fbffffb4a in ?? () #15 0x0000000000000000 in ?? () ================================== clvmd is stuck in kernel-mode in the DLM: clvmd D 000001013a47f8d8 0 6276 1 7499 (NOTLB) 000001010c9b1da8 0000000000000002 ffffffffa0290457 000001010000006a 0000000000000000 0000000000000001 000001000104da80 000000018013250d 0000010135e007f0 0000000000001ccb Call Trace:<ffffffffa0290457>{:dlm:dlm_recoverd+0} <ffffffff8030c72d>{wait_for_completion+167} <ffffffff8013416c>{default_wake_function+0} <ffffffff8013416c>{default_wake_function+0} <ffffffffa026c3f1>{:cman:kcl_join_service+381} <ffffffffa0287ddb>{:dlm:dlm_new_lockspace+1418} <ffffffffa0282878>{:dlm:dlm_write+2060} <ffffffff8017a4b2>{vfs_write+207} <ffffffff8017a59a>{sys_write+69} <ffffffff8011026a>{system_call+126} ... digging deeper. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/3 ------------------------------------------------------------------------ On 2007-10-24T13:45:53+00:00 Lon wrote: Note on boot, the dlm whines about "connect from non-cluster node". I haven't seen this message before, and I wonder if it's related to a class of bugs in CMAN on RHEL5 where node-0 (i.e. qdisk) is getting treated as a node. I didn't think this was existent on RHEL4, but I'll double check. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/4 ------------------------------------------------------------------------ On 2007-10-24T13:52:38+00:00 Lon wrote: Created attachment 236221 Full dmesg Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/5 ------------------------------------------------------------------------ On 2007-10-24T13:52:52+00:00 Lon wrote: (including stack traces) Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/6 ------------------------------------------------------------------------ On 2007-10-24T13:56:47+00:00 Lon wrote: It's waiting for the clvmd lockspace to finish joining: DLM Lock Space: "clvmd" 3 3 join S-6,20,3 [4 1 3] Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/7 ------------------------------------------------------------------------ On 2007-10-24T15:04:39+00:00 Lon wrote: CMAN nodes: crash> p *(struct cluster_node *)0x1010d799c00 $35 = { list = { next = 0x1010d799880, prev = 0x1013959e300 }, name = 0x1013b621280 "et-virt08.lab.boston.redhat.com", addr_list = { next = 0x10005f68d00, prev = 0x10005f68d00 }, us = 0, node_id = 1, state = NODESTATE_MEMBER, last_seq_recv = 1629, last_ackneeded_seq_recv = 603, last_seq_acked = 18, last_seq_sent = 18, votes = 1, expected_votes = 3, leave_reason = 0, incarnation = 0, last_hello = 4299878162, join_time = { tv_sec = 1193232725, tv_usec = 798085 } } crash> p *(struct cluster_node *)0x1010d799880 $36 = { list = { next = 0x1010d4c3b00, prev = 0x1010d799c00 }, name = 0x1013b621520 "et-virt09.lab.boston.redhat.com", addr_list = { next = 0x10005f5ed80, prev = 0x10005f5ed80 }, us = 0, node_id = 2, state = NODESTATE_DEAD, last_seq_recv = 0, last_ackneeded_seq_recv = 0, last_seq_acked = 0, last_seq_sent = 0, votes = 1, expected_votes = 3, leave_reason = 0, incarnation = 0, last_hello = 4294749092, join_time = { tv_sec = 1193232725, tv_usec = 798082 } } crash> p *(struct cluster_node *)0x1010d4c3b00 $37 = { list = { next = 0xffffffffa027ce40, prev = 0x1010d799880 }, name = 0x1013b6214c0 "et-virt10.lab.boston.redhat.com", addr_list = { next = 0x10005f68c40, prev = 0x10005f68c40 }, us = 1, node_id = 3, state = NODESTATE_MEMBER, last_seq_recv = 0, last_ackneeded_seq_recv = 0, last_seq_acked = 0, last_seq_sent = 14, votes = 1, expected_votes = 3, leave_reason = 0, incarnation = 0, last_hello = 4294749093, join_time = { tv_sec = 1193232723, tv_usec = 110389 } } Quorum device node: crash> p *(struct cluster_node *)0x000001010cee2a80 $38 = { list = { next = 0x0, prev = 0x0 }, name = 0x101341a08e0 "/dev/sdh", addr_list = { next = 0x1010cee2a98, prev = 0x1010cee2a98 }, us = 0, node_id = 0, state = NODESTATE_MEMBER, last_seq_recv = 0, last_ackneeded_seq_recv = 0, last_seq_acked = 0, last_seq_sent = 0, votes = 3, expected_votes = 0, leave_reason = 0, incarnation = 0, last_hello = 4300014856, join_time = { tv_sec = 0, tv_usec = 0 } } Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/8 ------------------------------------------------------------------------ On 2007-10-24T17:35:45+00:00 Lon wrote: crash> set 6280 PID: 6280 COMMAND: "dlm_recoverd" TASK: 1010c9b67f0 [THREAD_INFO: 1010ca08000] CPU: 0 STATE: TASK_INTERRUPTIBLE crash> bt PID: 6280 TASK: 1010c9b67f0 CPU: 0 COMMAND: "dlm_recoverd" #0 [1010ca09c28] schedule at ffffffff8030c4e9 #1 [1010ca09d00] dlm_wait_function at ffffffffa028f2f1 #2 [1010ca09d60] __wake_up at ffffffff80134213 #3 [1010ca09dc0] rcom_send_message at ffffffffa028ef3e #4 [1010ca09e00] dlm_wait_status_low at ffffffffa028f441 #5 [1010ca09e50] nodes_reconfig_wait at ffffffffa028ac98 #6 [1010ca09e70] ls_nodes_init at ffffffffa028b0cc #7 [1010ca09eb0] dlm_recoverd at ffffffffa0291179 #8 [1010ca09f20] kthread at ffffffff8014b907 #9 [1010ca09f50] kernel_thread at ffffffff80110f47 Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/comments/9 ------------------------------------------------------------------------ On 2007-10-24T20:07:12+00:00 Lon wrote: Created attachment 236581 cluster.conf Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/10 ------------------------------------------------------------------------ On 2007-10-25T20:31:05+00:00 Lon wrote: === Output from et-virt08 === dlm: connection from 344fa8c0 dlm: connect from non cluster node dlm: connection from 344fa8c0 dlm: connect from non cluster node dlm: connection from 364fa8c0 dlm: connect from non cluster node dlm: connection from 354fa8c0 dlm: connect from non cluster node 344fa8c0 = 192.168.79.52 354fa8c0 = 192.168.79.53 364fa8c0 = 192.168.79.54 The other nodes in the cluster are routing packets out from the wrong IP addresses. The DLM needs to source-route the packets from the correct IPs, or we need to ensure all packets get routed out the right IP address. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/comments/11 ------------------------------------------------------------------------ On 2007-10-25T20:32:04+00:00 Lon wrote: On the test cluster, everything works with a clean boot - this is because Oracle isn't running yet. Once Oracle starts up, it adds several VIPs to the interfaces the cluster is using to communicate. This causes routing problems. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/12 ------------------------------------------------------------------------ On 2007-10-25T20:33:47+00:00 Lon wrote: When we reboot the machine, the nodes are supposed to be sourcing the packets from their node IP address, but this isn't working correctly. This in turn means the packets are being sourced from another IP address on the interface, which isn't necessarily the node IP address. This causes the DLM to reject the connections, causing the hang. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/13 ------------------------------------------------------------------------ On 2007-10-25T21:44:15+00:00 Lon wrote: Created attachment 238121 Fix, pass 1 Source-routes the connect from the local_addr that we got from CMAN rather than letting the kernel decide where to source the packets from. This eliminates the need to work around the problem using routing or iptables. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/14 ------------------------------------------------------------------------ On 2007-10-25T21:59:25+00:00 Lon wrote: A similar patch could be made for upstream too. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/15 ------------------------------------------------------------------------ On 2007-10-26T14:14:43+00:00 Lon wrote: Patch for upstream submitted to cluster-devel. Only works for TCP; sctp may or may not need a similar patch. https://www.redhat.com/archives/cluster-devel/2007-October/msg00220.html Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/16 ------------------------------------------------------------------------ On 2007-10-26T14:47:13+00:00 Lon wrote: You can tinker around this using IPtables too. On *all* nodes, do something like: iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \ -m tcp -p tcp -d <node_1_ip_addr> \ -j SNAT --to-source <my_cluster_ip> iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \ -m tcp -p tcp -d <node_1_ip_addr> \ -j SNAT --to-source <my_cluster_ip> ... iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \ -m tcp -p tcp -d <node_N_ip_addr> \ -j SNAT --to-source <my_cluster_ip> Ex: [root@et-virt09 ~]# iptables -t nat -A POSTROUTING -s 192.168.76.0/22 \ -m tcp -p tcp -d 192.168.79.125 \ -j SNAT --to-source 192.168.79.94 Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/comments/17 ------------------------------------------------------------------------ On 2007-10-26T20:42:24+00:00 Lon wrote: Note that this only seems to happen if multiple "non-secondary" addresses are showing up in 'ip addr list'. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/18 ------------------------------------------------------------------------ On 2007-10-30T17:42:35+00:00 Lon wrote: Patch in CVS. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/21 ------------------------------------------------------------------------ On 2007-10-30T17:44:34+00:00 Lon wrote: Patch in RHEL4 + RHEL46 branches. Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/22 ------------------------------------------------------------------------ On 2007-11-09T21:24:30+00:00 Lon wrote: Updated workaround: You can tinker around this using IPtables too. On *all* nodes, do something like: iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \ -m tcp -p tcp -d <node_1_ip_addr> \ -j SNAT --to-source <my_cluster_ip> iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \ -m tcp -p tcp -d <node_2_ip_addr> \ -j SNAT --to-source <my_cluster_ip> ... iptables -t nat -A POSTROUTING -s <cluster_subnet/mask> \ -m tcp -p tcp -d <node_N_ip_addr> \ -j SNAT --to-source <my_cluster_ip> Ex: [root@et-virt09 ~]# iptables -t nat -A POSTROUTING -s 192.168.76.0/22 \ -m tcp -p tcp -d 192.168.79.125 \ -j SNAT --to-source 192.168.79.94 Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster- suite/+bug/158288/comments/23 ------------------------------------------------------------------------ On 2007-11-21T21:56:01+00:00 errata-xmlrpc wrote: An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0995.html Reply at: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/comments/28 ** Changed in: redhat-cluster-suite (Fedora) Importance: Unknown => Medium -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/158288 Title: Node hangs at clvm when joining cluster To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/redhat-cluster-suite/+bug/158288/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs