Hi all, I have two Virtualbox VM running on two different physical
hosts. The vm are interconnected with two gigabit ethernet for drbd sync
and heartbeat.
Suddenly I get this on master machine:
Feb 9 10:53:24 mail1 kernel: [136200.650336] INFO: task
jbd2/drbd0-8:13739 blocked for more than 120 seconds.
Feb 9 10:53:24 mail1 kernel: [136200.650967] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 10:53:24 mail1 kernel: [136200.651651] jbd2/drbd0-8 D
0000000000000002 0 13739 2 0x00000000
Feb 9 10:53:24 mail1 kernel: [136200.651660] ffff880030365b30
0000000000000046 0000000000015bc0 0000000000015bc0
Feb 9 10:53:24 mail1 kernel: [136200.651668] ffff88003cddb198
ffff880030365fd8 0000000000015bc0 ffff88003cddade0
Feb 9 10:53:24 mail1 kernel: [136200.651676] 0000000000015bc0
ffff880030365fd8 0000000000015bc0 ffff88003cddb198
Feb 9 10:53:24 mail1 kernel: [136200.651684] Call Trace:
Feb 9 10:53:24 mail1 kernel: [136200.651725] [<ffffffff810f3cd0>] ?
sync_page+0x0/0x50
Feb 9 10:53:24 mail1 kernel: [136200.651743] [<ffffffff81559633>]
io_schedule+0x73/0xc0
Feb 9 10:53:24 mail1 kernel: [136200.651751] [<ffffffff810f3d0d>]
sync_page+0x3d/0x50
Feb 9 10:53:24 mail1 kernel: [136200.651759] [<ffffffff81559c7f>]
__wait_on_bit+0x5f/0x90
Feb 9 10:53:24 mail1 kernel: [136200.651766] [<ffffffff810f3ec3>]
wait_on_page_bit+0x73/0x80
Feb 9 10:53:24 mail1 kernel: [136200.651775] [<ffffffff81084440>] ?
wake_bit_function+0x0/0x40
Feb 9 10:53:24 mail1 kernel: [136200.651790] [<ffffffff810fe305>] ?
pagevec_lookup_tag+0x25/0x40
Feb 9 10:53:24 mail1 kernel: [136200.651798] [<ffffffff810f4355>]
wait_on_page_writeback_range+0xf5/0x190
Feb 9 10:53:24 mail1 kernel: [136200.651805] [<ffffffff810f441f>]
filemap_fdatawait+0x2f/0x40
Feb 9 10:53:24 mail1 kernel: [136200.651814] [<ffffffff8121c6d4>]
jbd2_journal_commit_transaction+0x744/0x1280
Feb 9 10:53:24 mail1 kernel: [136200.651822] [<ffffffff81076a59>] ?
try_to_del_timer_sync+0x79/0xd0
Feb 9 10:53:24 mail1 kernel: [136200.651831] [<ffffffff8122378d>]
kjournald2+0xbd/0x220
Feb 9 10:53:24 mail1 kernel: [136200.651838] [<ffffffff81084400>] ?
autoremove_wake_function+0x0/0x40
Feb 9 10:53:24 mail1 kernel: [136200.651846] [<ffffffff812236d0>] ?
kjournald2+0x0/0x220
Feb 9 10:53:24 mail1 kernel: [136200.651853] [<ffffffff81084086>]
kthread+0x96/0xa0
Feb 9 10:53:24 mail1 kernel: [136200.651861] [<ffffffff810131ea>]
child_rip+0xa/0x20
Feb 9 10:53:24 mail1 kernel: [136200.651869] [<ffffffff81083ff0>] ?
kthread+0x0/0xa0
Feb 9 10:53:24 mail1 kernel: [136200.651876] [<ffffffff810131e0>] ?
child_rip+0x0/0x20
And from this moment many other errors of blocked tasks appears
(postfix, pickup and so on). The machine load was more than 25!
Obviously I cannot use the machine anymore and I needed to kill it in
order to force the takeover on the slave. Halt didn't work either.
My question is: why did I get this error? What can I do to avoid it?
Thanks
--
Dario Fiumicello - Antek S.r.l.
+3902890380 73
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user