Dear All, We found additional error message in dmesg of MDT server:
LustreError: 3949:0:(lod_lov.c:1066:validate_lod_and_idx()) chome-MDT0000-mdtlov: bad idx: 2 of 32 [20940.114011] LustreError: 3949:0:(lod_lov.c:1066:validate_lod_and_idx()) Skipped 71754441 previous similar messages I am not sure whether it caused the indefinitely running of the "orph_cleanup_ch" process in MDT. Is there any way to fix it ? (now it has run more than 6.5 hours, and is still running) Thanks very much. T.H.Hsieh On Sat, Aug 08, 2020 at 03:44:18PM +0800, Tung-Han Hsieh wrote: > Dear All, > > We have a running Lustre file system with version 2.10.7. The MDT > server runs Linux kernel 3.0.101, and MDT is using ldiskfs backend > with patched Linux kernel. > > Today our MDT server crashed and needed cold reboot. In other words, > the Lustre MDT was not cleanly unmounted before reboot. After reboot, > and mounted the MDT partition, we found that it has the "orph_cleanup_ch" > process running indefinitely. Up to now, it already ran more than > 4 hours. It took 100% usage of CPU (one CPU core), and leaded system > lock with a lot of the following dmesg messages: > > [16240.692491] INFO: rcu_sched_state detected stall on CPU 2 (t=3167100 > jiffies) > [16240.692524] Pid: 3949, comm: orph_cleanup_ch Not tainted 3.0.101 #1 > [16240.692551] Call Trace: > [16240.692572] <IRQ> [<ffffffff8109aad8>] ? __rcu_pending+0x258/0x460 > [16240.692608] [<ffffffff8109b769>] ? rcu_check_callbacks+0x69/0x130 > [16240.692637] [<ffffffff8104f046>] ? update_process_times+0x46/0x80 > [16240.692668] [<ffffffff8106ddc8>] ? tick_sched_timer+0x58/0xa0 > [16240.692697] [<ffffffff81061f6c>] ? __run_hrtimer.isra.34+0x3c/0xd0 > [16240.692726] [<ffffffff810625af>] ? hrtimer_interrupt+0xdf/0x230 > [16240.692756] [<ffffffff81020ff7>] ? smp_apic_timer_interrupt+0x67/0xa0 > [16240.692791] [<ffffffff81443f13>] ? apic_timer_interrupt+0x13/0x20 > [16240.692818] <EOI> [<ffffffffa0286801>] ? > __ldiskfs_check_dir_entry+0xb1/0x1 > d0 [ldiskfs] > [16240.692873] [<ffffffffa02871d3>] ? ldiskfs_htree_store_dirent+0x133/0x190 > [ldiskfs] > [16240.692920] [<ffffffffa02692a5>] ? htree_dirblock_to_tree+0xc5/0x170 > [ldiskfs] > [16240.692966] [<ffffffffa026dd41>] ? ldiskfs_htree_fill_tree+0x171/0x220 > [ldiskfs] > [16240.693012] [<ffffffffa0286a77>] ? ldiskfs_readdir+0x157/0x760 [ldiskfs] > [16240.693054] [<ffffffffa0569b3c>] ? top_trans_stop+0x13c/0xaa0 [ptlrpc] > [16240.693084] [<ffffffffa0942c40>] ? osd_it_ea_next+0x190/0x190 > [osd_ldiskfs] > [16240.693116] [<ffffffffa02963fe>] ? htree_lock_try+0x3e/0x80 [ldiskfs] > [16240.693146] [<ffffffffa0942842>] ? osd_ldiskfs_it_fill+0xa2/0x220 > [osd_ldiskfs] > [16240.693191] [<ffffffffa0942b66>] ? osd_it_ea_next+0xb6/0x190 [osd_ldiskfs] > [16240.693222] [<ffffffffa0b188ac>] ? lod_it_next+0x1c/0x90 [lod] > [16240.693251] [<ffffffffa0b871fa>] ? __mdd_orphan_cleanup+0x33a/0x1770 [mdd] > [16240.693281] [<ffffffff81039b1d>] ? default_wake_function+0xd/0x10 > [16240.693310] [<ffffffffa0b86ec0>] ? orph_declare_index_delete+0x6b0/0x6b0 > [mdd] > [16240.693354] [<ffffffffa0b86ec0>] ? orph_declare_index_delete+0x6b0/0x6b0 > [mdd] > [16240.693398] [<ffffffff8105e039>] ? kthread+0x99/0xa0 > [16240.693425] [<ffffffff81444674>] ? kernel_thread_helper+0x4/0x10 > [16240.693453] [<ffffffff8105dfa0>] ? kthread_flush_work_fn+0x10/0x10 > [16240.693480] [<ffffffff81444670>] ? gs_change+0xb/0xb > > > We guess that this process is trying to do consistant check for the > MDT partition since it did not cleanly unmount when system cold reboot. > Although the whole file system looks normal, i.e., we can mount the > clients, but we are wondering whether the process could eventually > complete the work or not. Otherwise the operating system of MDT is > always locked by this process, which leads to every works abnormal > (e.g., the "df" cmmand hangs forever, the systemd process is also > locked in "D" state, and ssh login, nis seems abnormal ....). > > Any suggestions to fix this problem is very appreciated. > > Thank you very much. > > Best Regards, > > T.H.Hsieh > _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
