Hello List,

Earlier I was writing about my 3 node web cluster running from OCFS2. I 
experimented with different 3.x kernels including 3.2, 3.13, 3.16 but so far 
the 4.1.1 proved to be the most stable. Never the less it had major crashes in 
the last couple of days again (while the R/W operations are relatively low for 
my setup).

I have all 3 nodes running in KVM machines on the same server, communicating 
with each other through libvirt-net driver (what as far as I understand is just 
memory copy, no packets get send out to the wire so theoretically this should 
provide gbit/s wide reliable, low latency link between the VMs). Now the logs 
suggest that the nodes lose connection between one another sometimes but this 
might not be a network issue but something is holding the cpu (the host server 
which is running the same 4.1.1 kernel has more than enough resources, 48 CPUs 
+ 256GB ram). The only meaningful line for me in the log is:

NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!

Any ideas? Maybe I should use different IO scheduler or something inside the 
vm? Would upgrading the guest kernel from 4.1.1 to the latest stable improve 
anything?


Nov 14 19:05:17 webserver1 kernel: [2004352.064040] NMI watchdog: BUG: soft 
lockup - CPU#0 stuck for 23s! [kworker/u2:1:16601]
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Modules linked in: ocfs2 
quota_tree nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace 
fscache sunrpc ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager 
ocfs2_stackglue configfs loop psmouse pcspkr joydev evdev serio_raw ac
pi_cpufreq i2c_piix4 processor i2c_core button virtio_balloon thermal_sys 
hid_generic usbhid dm_mod ata_generic virtio_net virtio_blk uhci_hcd
 ehci_hcd ata_piix libata usbcore virtio_pci virtio_ring virtio usb_common
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] CPU: 0 PID: 16601 Comm: 
kworker/u2:1 Not tainted 4.1.1
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Hardware name: Bochs Bochs, 
BIOS Bochs 01/01/2011
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Workqueue: o2net 
o2net_rx_until_empty [ocfs2_nodemanager]
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] task: ffff880033084290 ti: 
ffff88002b498000 task.ti: ffff88002b498000
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RIP: 
0010:[<ffffffffa0193ca1>]  [<ffffffffa0193ca1>] 
__dlm_lookup_lockres_full+0xa3/0xe9 [
ocfs2_dlm]
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RSP: 0018:ffff88002b49bc28  
EFLAGS: 00000286
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RAX: 00000000ffffffc1 RBX: 
ffffffff81547380 RCX: 0000000000000017
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RDX: 000000000000001e RSI: 
ffff88006e9eb029 RDI: ffff880007c447a0
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] RBP: ffff88006e9eb028 R08: 
0000000000000066 R09: 000000000000002a
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] R10: 000000000000002a R11: 
dead000000200200 R12: 0000000000000246
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] R13: 0000000000000050 R14: 
0000000000000000 R15: 0000000000000000
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] FS:  00007f098602e7a0(0000) 
GS:ffff88007fc00000(0000) knlGS:0000000000000000
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] CS:  0010 DS: 0000 ES: 0000 
CR0: 000000008005003b
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] CR2: ffffffffff600400 CR3: 
000000007ca91000 CR4: 00000000000006f0
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Stack:
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  ffff88002b49bcf0 
0000000000000040 ffff880033869000 ffff88006e9eb028
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  000000000000001f 
00000000c6232e1b ffff88006e9eb000 ffffffffa0193d73
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  ffff88007fc00000 
ffff88006ec768e8 0000000000000082 ffff880033869000
Nov 14 19:05:17 webserver1 kernel: [2004352.064042] Call Trace:
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffffa0193d73>] ? 
__dlm_lookup_lockres+0x8c/0xd3 [ocfs2_dlm]
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffffa0193df9>] ? 
dlm_lookup_lockres+0x3f/0x5c [ocfs2_dlm]
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffffa01ae6e4>] ? 
dlm_unlock_lock_handler+0x2af/0x663 [ocfs2_dlm]
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffffa0157315>] ? 
o2net_handler_tree_lookup+0x5b/0xa8 [ocfs2_nodemanager]
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffffa0159552>] ? 
o2net_rx_until_empty+0xc2c/0xc7c [ocfs2_nodemanager]
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff81001623>] ? 
__switch_to+0x1d4/0x457
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff81065260>] ? 
pick_next_task_fair+0x174/0x320
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff8105215e>] ? 
process_one_work+0x179/0x283
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff81052445>] ? 
worker_thread+0x1b8/0x292
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff8105228d>] ? 
process_scheduled_works+0x25/0x25
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff81056175>] ? 
kthread+0x99/0xa1
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff810560dc>] ? 
__kthread_parkme+0x58/0x58
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff815d2b52>] ? 
ret_from_fork+0x42/0x70
Nov 14 19:05:17 webserver1 kernel: [2004352.064042]  [<ffffffff810560dc>] ? 
__kthread_parkme+0x58/0x58
ov 14 19:05:17 webserver1 kernel: [2004352.064042] Code: c0 75 02 0f 0b 48 8b 
7b 10 44 89 ee 31 db e8 8d d4 ff ff 48 8b 00 48 85 c0 74 46 45 
8d 6c 24 ff 4c 8d 75 01 48 89 c3 48 8b 7b 18 <0f> be 45 00 0f b6 17 39 c2 75 23 
44 39 63 14 75 1d 48 ff c7 4c 
Nov 14 19:05:17 webserver1 kernel: [2004356.576077] o2net: Connection to node 
webserver3 (num 2) at 10.0.0.247:7777 has been idle for 30.32 se
cs.


...
[Tue Nov 17 12:52:30 2015] o2net: Connection to node webserver1 (num 0) at 
10.0.0.245:7777 shutdown, state 7
[Tue Nov 17 12:52:32 2015] o2net: Connection to node webserver1 (num 0) at 
10.0.0.245:7777 shutdown, state 7
[Tue Nov 17 12:52:34 2015] o2net: Connection to node webserver1 (num 0) at 
10.0.0.245:7777 shutdown, state 7
[Tue Nov 17 12:52:36 2015] o2net: Connection to node webserver1 (num 0) at 
10.0.0.245:7777 shutdown, state 7
[Tue Nov 17 12:52:38 2015] o2net: Connection to node webserver1 (num 0) at 
10.0.0.245:7777 shutdown, state 7
[Tue Nov 17 12:52:40 2015] o2net: Connection to node webserver1 (num 0) at 
10.0.0.245:7777 shutdown, state 7
[Tue Nov 17 12:52:42 2015] o2net: Connection to node webserver1 (num 0) at 
10.0.0.245:7777 shutdown, state 7
[Tue Nov 17 12:52:44 2015] o2net: Connection to node webserver1 (num 0) at 
10.0.0.245:7777 shutdown, state 7
[Tue Nov 17 12:52:46 2015] o2net: Connection to node webserver1 (num 0) at 
10.0.0.245:7777 shutdown, state 7
[Tue Nov 17 12:52:46 2015] o2net: Accepted connection from node webserver3 (num 
2) at 10.0.0.247:7777
[Tue Nov 17 12:52:48 2015] o2net: Connected to node webserver1 (num 0) at 
10.0.0.245:7777
[Tue Nov 17 12:52:49 2015] o2dlm: Joining domain 
1CA770B625644665B7677546DCC1211C ( 0 1 2 ) 3 nodes
[Tue Nov 17 12:52:49 2015] ocfs2: Mounting device (252,16) on (node 1, slot 1) 
with writeback data mode.
[Tue Nov 17 12:52:54 2015] o2dlm: Joining domain 
0DE1B15CBA5340F09A7908313FCB3680 ( 0 1 2 ) 3 nodes
[Tue Nov 17 12:52:54 2015] ocfs2: Mounting device (252,32) on (node 1, slot 0) 
with writeback data mode.
[Tue Nov 17 12:52:58 2015] o2dlm: Joining domain 
573D9BC0E98B47BE8EBB8FE7F1CB5281 ( 0 1 2 ) 3 nodes
[Tue Nov 17 12:52:58 2015] ocfs2: Mounting device (252,48) on (node 1, slot 1) 
with writeback data mode.


Thank you!

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to