Hi Andrew, kernel: LustreError: 13921:0:(genops.c:478:class_register_device()) astrofs-OST0000-osc-MDT0001: already exists, won't add
is symptomatic of a llog index issue/mismatch on the MDT vs. MGT. I would check if the llog backup of MDT0001 (over ldiskfs in CONFIGS) matches the one on the MGT. The llog indexes should match. If your MDT ran out of space/inodes (!), perhaps the llog backup has failed somehow or got corrupted. There are multiple patches in 2.12+ that address various issues with config llog (for example, if you used llog_cancel). I don’t think lfsck can repair config llog issues. At worse, you could try to do a writeconf on the whole filesystem. Good luck, Stéphane > On Apr 19, 2022, at 2:40 AM, Andrew Elwell via lustre-discuss > <[email protected]> wrote: > > Hi Folks, > > One of our filesystems seemed to fail over the holiday weekend - we're > running DNE and MDT0001 won't mount. At first it looked like we'd run > out of space (rc = -28) but then we were seeing this > > mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001 > failed: File exists retries left: 0 > mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001 > failed: File exists > > possibly > kernel: LustreError: 13921:0:(genops.c:478:class_register_device()) > astrofs-OST0000-osc-MDT0001: already exists, won't add > > lustre_rmmod wouldn't remove everything cleanly (osc in use) and so > after a reboot everything *seemed* to start OK > > [root@astrofs-mds1 ~]# mount -t lustre > /dev/mapper/MGS on /lustre/MGS type lustre (ro) > /dev/mapper/MDT0000 on /lustre/astrofs-MDT0000 type lustre (ro) > /dev/mapper/MDT0001 on /lustre/astrofs-MDT0001 type lustre (ro) > > ... but not for long > > kernel: LustreError: 12355:0:(osp_sync.c:343:osp_sync_declare_add()) > ASSERTION( ctxt ) failed: > kernel: LustreError: 12355:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG > > possibly corrupt llog? > > I see LU-12674 which looks like our problem, but only backported to > 2.12 branch (these servers are still 2.10.8) > > Piecing together what *might* have happened is a user possibly ran out > of inodes and then did a rm -r before the system stopped responding. > > Mounting just now I'm getting: > [ 1985.078422] LustreError: 10953:0:(llog.c:654:llog_process_thread()) > astrofs-OST0001-osc-MDT0001: Local llog found corrupted #0x7ede0:1:0 > plain index 35518 count 2 > [ 1985.095129] LustreError: > 10959:0:(llog_osd.c:961:llog_osd_next_block()) astrofs-MDT0001-osd: > invalid llog tail at log id [0x7ef40:0x1:0x0]:0 offset 577536 bytes > 4096 > [ 1985.109892] LustreError: > 10959:0:(osp_sync.c:1242:osp_sync_thread()) > astrofs-OST0004-osc-MDT0001: llog process with osp_sync_process_queues > failed: -22 > [ 1985.126797] LustreError: > 10973:0:(llog_cat.c:269:llog_cat_id2handle()) > astrofs-OST000b-osc-MDT0001: error opening log id [0x7ef76:0x1:0x0]:0: > rc = -2 > [ 1985.140169] LustreError: > 10973:0:(llog_cat.c:823:llog_cat_process_cb()) > astrofs-OST000b-osc-MDT0001: cannot find handle for llog > [0x7ef76:0x1:0x0]: rc = -2 > [ 1985.155321] Lustre: astrofs-MDT0001: Imperative Recovery enabled, > recovery window shrunk from 300-900 down to 150-900 > [ 1985.169404] Lustre: astrofs-MDT0001: in recovery but waiting for > the first client to connect > [ 1985.177869] Lustre: astrofs-MDT0001: Will be in recovery for at > least 2:30, or until 1508 clients reconnect > [ 1985.187612] Lustre: astrofs-MDT0001: Connection restored to > a5e41149-73fc-b60a-30b1-da096a5c2527 (at 1170@gni1) > [ 2017.251374] Lustre: astrofs-MDT0001: Connection restored to > 7a388f58-bc16-6bd7-e0c8-4ffa7c0dd305 (at 400@gni1) > [ 2017.261374] Lustre: Skipped 1275 previous similar messages > [ 2081.458117] Lustre: astrofs-MDT0001: Connection restored to > 10.10.36.143@o2ib4 (at 10.10.36.143@o2ib4) > [ 2081.467419] Lustre: Skipped 277 previous similar messages > [ 2082.324547] Lustre: astrofs-MDT0001: Recovery over after 1:37, of > 1508 clients 1508 recovered and 0 were evicted. > > Message from syslogd@astrofs-mds2 at Apr 19 17:32:49 ... > kernel: LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) > ASSERTION( ctxt ) failed: > > Message from syslogd@astrofs-mds2 at Apr 19 17:32:49 ... > kernel: LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG > [ 2082.392381] LustreError: > 11082:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt ) > failed: > [ 2082.401422] LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) > LBUG > [ 2082.408558] Pid: 11082, comm: orph_cleanup_as > 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon May 27 03:45:37 UTC 2019 > [ 2082.418891] Call Trace: > [ 2082.421340] [<ffffffffc0af07cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] > [ 2082.427890] [<ffffffffc0af087c>] lbug_with_loc+0x4c/0xa0 [libcfs] > [ 2082.434077] [<ffffffffc1694159>] osp_sync_declare_add+0x3a9/0x3e0 [osp] > [ 2082.440797] [<ffffffffc1683299>] osp_declare_destroy+0xc9/0x1c0 [osp] > [ 2082.447338] [<ffffffffc15e0c6e>] lod_sub_declare_destroy+0xce/0x2d0 [lod] > [ 2082.454237] [<ffffffffc15c54a5>] lod_obj_stripe_destroy_cb+0x85/0x90 [lod] > [ 2082.461213] [<ffffffffc15d0ac6>] lod_obj_for_each_stripe+0xb6/0x230 [lod] > [ 2082.468104] [<ffffffffc15d184b>] lod_declare_destroy+0x43b/0x5c0 [lod] > [ 2082.474736] [<ffffffffc1648896>] orph_key_test_and_del+0x5f6/0xd30 [mdd] > [ 2082.481538] [<ffffffffc1649587>] __mdd_orphan_cleanup+0x5b7/0x840 [mdd] > [ 2082.488250] [<ffffffffa7cc1c31>] kthread+0xd1/0xe0 > [ 2082.493147] [<ffffffffa8374c1d>] ret_from_fork_nospec_begin+0x7/0x21 > [ 2082.499601] [<ffffffffffffffff>] 0xffffffffffffffff > [ 2082.504585] Kernel panic - not syncing: LBUG > > e2fsck when mounted as lfiskfs seems to be clean, but is there a way I > can get it mounted enough to run lfsck? > > Alternatively, can I upgrade the MDSs to 2.12.x while having the OSSs > still on 2.10? yes I know this isn't ideal but I wasn't planning a > large upgrade at zero notice to our users (also, we still have a > legacy system accessing it with a 2.7 client - it's replacement > arrived last Sept, but still hasn't been handed over to us yet, so I > really don't want to get too out of step) > > Many thanks > > Andrew > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
