On Wed, Aug 18, 2010 at 6:24 PM, Roberto Giordani <r.giord...@libero.it> wrote: > Hello, > I'll explain what’s happened after a network black-out > I've a cluster with pacemaker on Opensuse 11.2 64bit > ============ > Last updated: Wed Aug 18 18:13:33 2010 > Current DC: nodo1 (nodo1) > Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160 > 3 Nodes configured. > 11 Resources configured. > ============ > > Node: nodo1 (nodo1): online > Node: nodo3 (nodo3): online > Node: nodo4 (nodo4): online > > Clone Set: dlm-clone > dlm:0 (ocf::pacemaker:controld): Started nodo3 > dlm:1 (ocf::pacemaker:controld): Started nodo1 > dlm:2 (ocf::pacemaker:controld): Started nodo4 > Clone Set: o2cb-clone > o2cb:0 (ocf::ocfs2:o2cb): Started nodo3 > o2cb:1 (ocf::ocfs2:o2cb): Started nodo1 > o2cb:2 (ocf::ocfs2:o2cb): Started nodo4 > Clone Set: XencfgFS-Clone > XencfgFS:0 (ocf::heartbeat:Filesystem): Started nodo3 > XencfgFS:1 (ocf::heartbeat:Filesystem): Started nodo1 > XencfgFS:2 (ocf::heartbeat:Filesystem): Started nodo4 > Clone Set: XenimageFS-Clone > XenimageFS:0 (ocf::heartbeat:Filesystem): Started nodo3 > XenimageFS:1 (ocf::heartbeat:Filesystem): Started nodo1 > XenimageFS:2 (ocf::heartbeat:Filesystem): Started nodo4 > rsa1-fencing (stonith:external/ibmrsa-telnet): Started nodo4 > rsa2-fencing (stonith:external/ibmrsa-telnet): Started nodo3 > rsa3-fencing (stonith:external/ibmrsa-telnet): Started nodo4 > rsa4-fencing (stonith:external/ibmrsa-telnet): Started nodo3 > mailsrv-rm (ocf::heartbeat:Xen): Started nodo3 > dbsrv-rm (ocf::heartbeat:Xen): Started nodo4 > websrv-rm (ocf::heartbeat:Xen): Started nodo4 > > After a switch failure all the nodes and the rsa stonith devices was > unreachable. > > On the cluster happen the following error on one node > > Aug 18 13:11:38 nodo1 cluster-dlm: receive_plocks_stored: > receive_plocks_stored 1778493632:2 need_plocks 0#012 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272025] ------------[ cut here > ]------------ > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at > /usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323! > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272042] invalid opcode: 0000 [#1] SMP > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272046] last sysfs file: > /sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272050] CPU 1 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272053] Modules linked in: > nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev > iptable_filter ip_tables x_tables ocfs2 ocfs2_nodemanager quota_tree > ocfs2_stack_user ocfs2_stackglue dlm configfs netbk coretemp blkbk > blkback_pagemap blktap xenbus_be ipmi_si edd dm_round_robin scsi_dh_rdac > dm_multipath scsi_dh bridge stp llc bonding ipv6 fuse ext4 jbd2 crc16 loop > dm_mod sr_mod ide_pci_generic ide_core iTCO_wdt ata_generic ibmpex i5k_amb > ibmaem iTCO_vendor_support ipmi_msghandler bnx2 i5000_edac 8250_pnp shpchp > ata_piix pcspkr ics932s401 joydev edac_core i2c_i801 ses pci_hotplug 8250 > i2c_core serio_raw enclosure serial_core button sg reiserfs usbhid hid > uhci_hcd ehci_hcd xenblk cdrom xennet fan processor pata_acpi lpfc thermal > thermal_sys hwmon aacraid [last unloaded: ocfs2_stackglue] > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272111] Pid: 8889, comm: dlm_send Not > tainted 2.6.31.12-0.2-xen #1 IBM System x3650 -[7979AC1]- > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272113] RIP: e030:[<ffffffff801331c2>] > [<ffffffff801331c2>] iput+0x82/0x90 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272121] RSP: e02b:ffff88014ec03c30 > EFLAGS: 00010246 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272122] RAX: 0000000000000000 RBX: > ffff880148a703c8 RCX: 0000000000000000 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272123] RDX: ffffc90000010000 RSI: > ffff880148a70380 RDI: ffff880148a703c8 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272125] RBP: ffff88014ec03c50 R08: > b038000000000000 R09: fe99594c51a57607 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272126] R10: ffff880040410270 R11: > 0000000000000000 R12: ffff8801713e6e08 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272128] R13: ffff88014ec03d20 R14: > 0000000000000000 R15: ffffc9000331d108 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272133] FS: 00007ff4cb11a730(0000) > GS:ffffc90000010000(0000) knlGS:0000000000000000 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272135] CS: e033 DS: 0000 ES: 0000 CR0: > 000000008005003b > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272136] CR2: 00007ff4c5c45000 CR3: > 0000000135b2a000 CR4: 0000000000002660 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272138] DR0: 0000000000000000 DR1: > 0000000000000000 DR2: 0000000000000000 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272140] DR3: 0000000000000000 DR6: > 00000000ffff0ff0 DR7: 0000000000000400 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272142] Process dlm_send (pid: 8889, > threadinfo ffff88014ec02000, task ffff8801381e45c0) > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272143] Stack: > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272144] 0000000000000000 > 00000000072f0874 ffff880148a70380 ffff880148a70380 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272146] <0> ffff88014ec03c80 > ffffffff803add09 ffff88014ec03c80 00000000072f0874 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272147] <0> ffff8801713e6df8 > ffff8801713e6e08 ffff88014ec03de0 ffffffffa05661e1 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272150] Call Trace: > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272164] [<ffffffff803add09>] > sock_release+0x89/0xa0 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272177] [<ffffffffa05661e1>] > tcp_connect_to_sock+0x161/0x2b0 [dlm] > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272206] [<ffffffffa0568764>] > process_send_sockets+0x34/0x60 [dlm] > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272222] [<ffffffff800693f3>] > run_workqueue+0x83/0x230 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272227] [<ffffffff80069654>] > worker_thread+0xb4/0x140 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272231] [<ffffffff8006fac6>] > kthread+0xb6/0xc0 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272236] [<ffffffff8000d38a>] > child_rip+0xa/0x20 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272240] Code: 42 20 48 c7 c2 b0 4c 13 > 80 48 85 c0 48 0f 44 c2 48 89 df ff d0 48 8b 45 e8 65 48 33 04 25 28 00 00 > 00 75 0b 48 83 c4 18 5b c9 c3 <0f> 0b eb fe e8 35 c6 f1 ff 0f 1f 44 00 00 55 > 48 8d 97 10 02 00 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272256] RIP [<ffffffff801331c2>] > iput+0x82/0x90 > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272259] RSP <ffff88014ec03c30> > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272264] ---[ end trace 7707d0d92a7f5415 > ]--- > > Aug 18 13:11:38 nodo1 kernel: [ 4154.272495] dlm: connect from non cluster > node > > and after few log lines the following line repeated until the node was > killed by me > > Aug 18 13:12:31 nodo1 cluster-dlm: start_kernel: start_kernel cg 3 > member_count 1#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: update_dir_members: dir_member > 1812048064#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: update_dir_members: dir_member > 1778493632#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: set_configfs_members: set_members rmdir > "/sys/kernel/config/dlm/cluster/spaces/0BB443F896254AD3BA8FB960C425B666/nodes/1812048064"#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: do_sysfs: write "1" to > "/sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control"#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no > nodeid 1812048064#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no > nodeid 1812048064#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no > nodeid 1812048064#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no > nodeid 1812048064#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no > nodeid 1812048064#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no > nodeid 1812048064#012 > > Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no > nodeid 1812048064#012 > > Attached the log file > > Someone can explain what is the reason?
Perhaps the membership got out of sync... Aug 18 13:11:38 nodo1 kernel: [ 4154.272495] dlm: connect from non cluster node Maybe lmb or dejan can suggest something... I dont have much to do with ocfs2 anymore. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker