Dunno if it is useful, but we had a never seen crash. Setup:
2x SuSE SLES 10SP2 (its old, I known) Problem description: 1. We had to reboot ocfs2 master node. 2. During the reboot, the umount coredumped, leaving the filesystem mounted or may be heartbeating (?); 3. The slave node detected that the slave was dead; 4. When the slave tried to assume the master status, it rebooted (no crash, no warning, nothing, just like press reset button); 5. The master hanged because it could not unmount ocfs2 filesystem; Could not take many messages from nodes, just this ones: master node umount crash (from syslog): Dec 2 14:22:08 soap02 kernel: (19573,5):dlm_empty_lockres:2783 ERROR: lockres M00000000000000164ad60700000000 still has local locks! Dec 2 14:22:08 soap02 kernel: ----------- [cut here ] --------- [please bite here ] --------- Dec 2 14:22:08 soap02 kernel: Kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2784 Dec 2 14:22:08 soap02 kernel: invalid opcode: 0000 [1] SMP Dec 2 14:22:08 soap02 kernel: last sysfs file: /devices/pci0000:00/0000:00:1c.0/0000:04:00.0/0000:05:00.0/power/state Dec 2 14:22:08 soap02 kernel: CPU 5 Dec 2 14:22:08 soap02 kernel: Modules linked in: af_packet joydev st ocfs2 jbd ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs nfsd exportfs nfs lockd nfs_acl sunrpc ipv6 button battery ac binfmt_misc netconsole xt_comment xt_tcpudp xt_state iptable_filter iptable_mangle iptab le_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables apparmor loop sr_mod usbhid usb_storage ide_cd uhci_hcd ehci_hcd usbcore shpchp hw_random cdrom bnx2 pci_hotplug reiserfs ata_piix ahci libata dm_snapshot qla2xxx firmware_class qla2xxx_conf intermodule edd dm_mod fan therm al processor sg megaraid_sas piix sd_mod scsi_mod ide_disk ide_core Dec 2 14:22:08 soap02 kernel: Pid: 19573, comm: umount Tainted: G U 2.6.16.60-0.21-smp #1 Dec 2 14:22:08 soap02 kernel: RIP: 0010:[<ffffffff885a9d6d>] <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} Dec 2 14:22:08 soap02 kernel: RSP: 0018:ffff810356f65c88 EFLAGS: 00010292 Dec 2 14:22:08 soap02 kernel: RAX: 000000000000006a RBX: ffff8101f28f7880 RCX: 0000000000000292 Dec 2 14:22:08 soap02 kernel: RDX: ffffffff80359968 RSI: 0000000000000296 RDI: ffffffff80359960 Dec 2 14:22:08 soap02 kernel: RBP: ffff81025eec7e00 R08: ffffffff80359968 R09: ffff810423f77a80 Dec 2 14:22:08 soap02 kernel: R10: ffff810001071600 R11: 0000000000000070 R12: 0000000000000184 Dec 2 14:22:08 soap02 kernel: R13: ffff8104257a5400 R14: 0000000000000184 R15: ffff8101f28f7880 Dec 2 14:22:08 soap02 kernel: FS: 00002ab1a83db6d0(0000) GS:ffff810430654840(0000) knlGS:0000000000000000 Dec 2 14:22:08 soap02 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Dec 2 14:22:08 soap02 kernel: CR2: 00002aaaaac16000 CR3: 00000001a2c2f000 CR4: 00000000000006e0 Dec 2 14:22:08 soap02 kernel: Process umount (pid: 19573, threadinfo ffff810356f64000, task ffff8102e78997e0) Dec 2 14:22:08 soap02 kernel: Stack: 00000000ffffffd9 0000000000000000 01ff810400000001 ffff8102e78997e0 Dec 2 14:22:08 soap02 kernel: 0100000000000000 0000000100000003 0000000000000000 ffff8102e78997e0 Dec 2 14:22:08 soap02 kernel: ffffffff80147f3e ffff810356f65cd0 Dec 2 14:22:08 soap02 kernel: Call Trace: <ffffffff80147f3e>{autoremove_wake_function+0} Dec 2 14:22:08 soap02 kernel: <ffffffff885a30e1>{:ocfs2_dlm:dlm_unregister_domain+479} Dec 2 14:22:08 soap02 kernel: <ffffffff8012c668>{default_wake_function+0} <ffffffff8860bb5e>{:ocfs2:ocfs2_dlm_shutdown+190} Dec 2 14:22:08 soap02 kernel: <ffffffff8862fe07>{:ocfs2:ocfs2_dismount_volume+559} Dec 2 14:22:08 soap02 kernel: <ffffffff886302f7>{:ocfs2:ocfs2_put_super+104} <ffffffff8018bc99>{generic_shutdown_super+148} Dec 2 14:22:08 soap02 kernel: <ffffffff8018bd6a>{kill_block_super+38} <ffffffff8018be40>{deactivate_super+114} Dec 2 14:22:08 soap02 kernel: <ffffffff801a078e>{sys_umount+623} <ffffffff8018e4e1>{sys_newstat+25} Dec 2 14:22:08 soap02 kernel: <ffffffff8010ae42>{system_call+126} Dec 2 14:22:08 soap02 kernel: Dec 2 14:22:08 soap02 kernel: Code: 0f 0b 68 95 d0 5b 88 c2 e0 0a 48 f7 05 9e 2c fd ff 00 09 00 Dec 2 14:22:08 soap02 kernel: RIP <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} RSP <ffff810356f65c88> Dec 2 14:22:08 soap02 kernel: Badness in do_exit at kernel/exit.c:837 Dec 2 14:22:08 soap02 kernel: Dec 2 14:22:08 soap02 kernel: Call Trace: <ffffffff80137000>{do_exit+80} <ffffffff802ea8b6>{_spin_unlock_irqrestore+8} Dec 2 14:22:08 soap02 kernel: <ffffffff8010c820>{kernel_math_error+0} <ffffffff8010cdb5>{do_invalid_op+163} Dec 2 14:22:09 soap02 kernel: <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} Dec 2 14:22:09 soap02 kernel: <ffffffff8012c10c>{activate_task+204} <ffffffff8012c657>{try_to_wake_up+1106} Dec 2 14:22:09 soap02 kernel: <ffffffff801349b8>{printk+78} <ffffffff8010bd19>{error_exit+0} Dec 2 14:22:09 soap02 kernel: <ffffffff885a9d6d>{:ocfs2_dlm:dlm_empty_lockres+5255} Dec 2 14:22:09 soap02 kernel: <ffffffff80147f3e>{autoremove_wake_function+0} <ffffffff885a30e1>{:ocfs2_dlm:dlm_unregister_domain+479} Dec 2 14:22:09 soap02 kernel: <ffffffff8012c668>{default_wake_function+0} <ffffffff8860bb5e>{:ocfs2:ocfs2_dlm_shutdown+190} Dec 2 14:22:09 soap02 kernel: <ffffffff8862fe07>{:ocfs2:ocfs2_dismount_volume+559} Dec 2 14:22:09 soap02 kernel: <ffffffff886302f7>{:ocfs2:ocfs2_put_super+104} <ffffffff8018bc99>{generic_shutdown_super+148} Dec 2 14:22:09 soap02 kernel: <ffffffff8018bd6a>{kill_block_super+38} <ffffffff8018be40>{deactivate_super+114} Dec 2 14:22:09 soap02 kernel: <ffffffff801a078e>{sys_umount+623} <ffffffff8018e4e1>{sys_newstat+25} Dec 2 14:22:09 soap02 kernel: <ffffffff8010ae42>{system_call+126} slave node detecting master down and rebooted: Dec 2 14:23:14 soap01 kernel: o2net: connection to node soap02 (num 0) at 192.168.0.10:7777 has been idle for 60.0 seconds, shutting it down. Dec 2 14:23:14 soap01 kernel: (0,0):o2net_idle_timer:1422 here are some times that might help debug the situation: (tmr 1259770934.129785 now 1259770994.132629 dr 1259770934.129779 adv 1259770934.129789:1259770934.129789 func (300d6acb:505) 1259770933.205787:1259770933.205792) Dec 2 14:23:14 soap01 kernel: o2net: no longer connected to node soap02 (num 0) at 192.168.0.10:7777 Dec 2 14:23:14 soap01 kernel: (7035,1):dlm_do_master_request:1409 ERROR: link to 0 went down! Dec 2 14:23:14 soap01 kernel: (7039,0):dlm_do_master_request:1409 ERROR: link to 0 went down! Dec 2 14:23:14 soap01 kernel: (7039,0):dlm_get_lock_resource:986 ERROR: status = -112 Dec 2 14:23:14 soap01 kernel: (7035,1):dlm_get_lock_resource:986 ERROR: status = -112 Dec 2 14:23:14 soap01 kernel: (7043,0):dlm_do_master_request:1409 ERROR: link to 0 went down! Dec 2 14:23:14 soap01 kernel: (7043,0):dlm_get_lock_resource:986 ERROR: status = -112 Dec 2 14:23:14 soap01 kernel: (7047,0):dlm_send_remote_convert_request:395 ERROR: status = -112 Dec 2 14:23:14 soap01 kernel: (7047,0):dlm_wait_for_node_death:370 F59B45831EEA41F384BADE6C4B7A932B: waiting 5000ms for notification of death of node 0 Dec 2 14:24:14 soap01 kernel: (5283,0):o2net_connect_expired:1583 ERROR: no connection established with node 0 after 60.0 seconds, giving up and returning errors. Dec 2 14:24:14 soap01 kernel: (7047,0):dlm_send_remote_convert_request:395 ERROR: status = -107 Dec 2 14:24:14 soap01 kernel: (7047,0):dlm_wait_for_node_death:370 F59B45831EEA41F384BADE6C4B7A932B: waiting 5000ms for notification of death of node 0 Hope this information is useful for something. Regards, -- .:''''':. .:' ` Sérgio Surkamp | Gerente de Rede :: ........ ser...@gruposinternet.com.br `:. .:' `:, ,.:' *Grupos Internet S.A.* `: :' R. Lauro Linhares, 2123 Torre B - Sala 201 : : Trindade - Florianópolis - SC :.' :: +55 48 3234-4109 : ' http://www.gruposinternet.com.br _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users