Dear All, We have a running Lustre file system with version 2.10.7. The MDT server runs Linux kernel 3.0.101, and MDT is using ldiskfs backend with patched Linux kernel.
Today our MDT server crashed and needed cold reboot. In other words, the Lustre MDT was not cleanly unmounted before reboot. After reboot, and mounted the MDT partition, we found that it has the "orph_cleanup_ch" process running indefinitely. Up to now, it already ran more than 4 hours. It took 100% usage of CPU (one CPU core), and leaded system lock with a lot of the following dmesg messages: [16240.692491] INFO: rcu_sched_state detected stall on CPU 2 (t=3167100 jiffies) [16240.692524] Pid: 3949, comm: orph_cleanup_ch Not tainted 3.0.101 #1 [16240.692551] Call Trace: [16240.692572] <IRQ> [<ffffffff8109aad8>] ? __rcu_pending+0x258/0x460 [16240.692608] [<ffffffff8109b769>] ? rcu_check_callbacks+0x69/0x130 [16240.692637] [<ffffffff8104f046>] ? update_process_times+0x46/0x80 [16240.692668] [<ffffffff8106ddc8>] ? tick_sched_timer+0x58/0xa0 [16240.692697] [<ffffffff81061f6c>] ? __run_hrtimer.isra.34+0x3c/0xd0 [16240.692726] [<ffffffff810625af>] ? hrtimer_interrupt+0xdf/0x230 [16240.692756] [<ffffffff81020ff7>] ? smp_apic_timer_interrupt+0x67/0xa0 [16240.692791] [<ffffffff81443f13>] ? apic_timer_interrupt+0x13/0x20 [16240.692818] <EOI> [<ffffffffa0286801>] ? __ldiskfs_check_dir_entry+0xb1/0x1 d0 [ldiskfs] [16240.692873] [<ffffffffa02871d3>] ? ldiskfs_htree_store_dirent+0x133/0x190 [ldiskfs] [16240.692920] [<ffffffffa02692a5>] ? htree_dirblock_to_tree+0xc5/0x170 [ldiskfs] [16240.692966] [<ffffffffa026dd41>] ? ldiskfs_htree_fill_tree+0x171/0x220 [ldiskfs] [16240.693012] [<ffffffffa0286a77>] ? ldiskfs_readdir+0x157/0x760 [ldiskfs] [16240.693054] [<ffffffffa0569b3c>] ? top_trans_stop+0x13c/0xaa0 [ptlrpc] [16240.693084] [<ffffffffa0942c40>] ? osd_it_ea_next+0x190/0x190 [osd_ldiskfs] [16240.693116] [<ffffffffa02963fe>] ? htree_lock_try+0x3e/0x80 [ldiskfs] [16240.693146] [<ffffffffa0942842>] ? osd_ldiskfs_it_fill+0xa2/0x220 [osd_ldiskfs] [16240.693191] [<ffffffffa0942b66>] ? osd_it_ea_next+0xb6/0x190 [osd_ldiskfs] [16240.693222] [<ffffffffa0b188ac>] ? lod_it_next+0x1c/0x90 [lod] [16240.693251] [<ffffffffa0b871fa>] ? __mdd_orphan_cleanup+0x33a/0x1770 [mdd] [16240.693281] [<ffffffff81039b1d>] ? default_wake_function+0xd/0x10 [16240.693310] [<ffffffffa0b86ec0>] ? orph_declare_index_delete+0x6b0/0x6b0 [mdd] [16240.693354] [<ffffffffa0b86ec0>] ? orph_declare_index_delete+0x6b0/0x6b0 [mdd] [16240.693398] [<ffffffff8105e039>] ? kthread+0x99/0xa0 [16240.693425] [<ffffffff81444674>] ? kernel_thread_helper+0x4/0x10 [16240.693453] [<ffffffff8105dfa0>] ? kthread_flush_work_fn+0x10/0x10 [16240.693480] [<ffffffff81444670>] ? gs_change+0xb/0xb We guess that this process is trying to do consistant check for the MDT partition since it did not cleanly unmount when system cold reboot. Although the whole file system looks normal, i.e., we can mount the clients, but we are wondering whether the process could eventually complete the work or not. Otherwise the operating system of MDT is always locked by this process, which leads to every works abnormal (e.g., the "df" cmmand hangs forever, the systemd process is also locked in "D" state, and ssh login, nis seems abnormal ....). Any suggestions to fix this problem is very appreciated. Thank you very much. Best Regards, T.H.Hsieh _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
