Hi Lustre users,

When reviewing the configuration logs prior to performing this work, I noticed 
that one of the OSTs in use is not listed in the configuration log for the MDT. 
The logs go from OST001e to OST0020 skipping OST001f with no mention of OST001f 
in the log.

The OST configuration log looks normal and was roughly as full as any of the 
other OSTs so it was getting data stored on it.

This now raises a concern for me: is it likely that one of the OST will have 
data we cannot recover if i rewrite these logs? At this point the file system 
cannot mount so I believe rewriting the logs is necessary in any case.

Jesse


________________________________________
From: lustre-discuss <[email protected]> on behalf of 
Jesse Stroik via lustre-discuss <[email protected]>
Sent: Tuesday, June 10, 2025 12:43 PM
To: [email protected]
Subject: [lustre-discuss] MDT refuses to mount: "no more free slots in catalog" 
"can't initialize llog"

Hi Lustre users,

Recently we had a breaker fail overnight which affected part of one of our data 
centers, including an older lustre setup. The lustre setup was in use prior to 
the power failure.

This is lustre 2.15.1 / zfs 2.1.2 running on Rocky 8 and using zfs for the all 
backend file systems.

When i attempt to start lustre on the MDS, it mounts the MGS and then starts 
mounting the MDT and goes into recovery before failing with the following 
messages:

=======
Lustre: 5089:0:(llog_cat.c:101:llog_cat_new_log()) arc15-OST0001-osc-MDT0000: 
there are no more free slots in catalog [0x5:0x1:0x0]:0
LustreError: 5089:0:(osp_sync.c:1553:osp_sync_init()) 
arc15-OST0001-osc-MDT0000: can't initialize llog: rc = -28
LustreError: 5089:0:(obd_config.c:774:class_setup()) setup 
arc15-OST0001-osc-MDT0000 failed (-28)
LustreError: 6519:0:(obd_config.c:2001:class_config_llog_handler()) 
MGC172.16.23.25@o2ib: cfg command failed: rc = -28
Lustre:    cmd=cf003 0:arc15-OST0001-osc-MDT0000  1:arc15-OST0001_UUID  
2:172.16.23.18@o2ib
=======

If i attempt to start lustre again or mount the MDT directly after this first 
attempt, i also see this message in the logs:

=======
 LustreError: 5350:0:(genops.c:522:class_register_device()) 
arc15-OST0001-osc-MDT0000: already exists, won't add
=======

LNET communication looks good, and this error happens whether or not the OSS 
units are powered up and have their OSTs mounted. The system didn't have a 
changelog user registered and wasn't consuming any changelog space.

If I mount just the MGS i can still see the configuration logs with: "lctl 
--device MGS llog_print <device>" for all devices.

My planned next steps are to regenerate the lustre configuration logs following 
the instructions here:

https://urldefense.com/v3/__https://doc.lustre.org/lustre_manual.xhtml*lustremaint.regenerateConfigLogs__;Iw!!Mak6IKo!NoJ1SobOf8MgZwfTk1FBugOyXCs5fBwgdxGrxIBkpPxU4Ww53FTn9YgzpZr9BQqZlUzWdTUuFNHx_b9hAV5_NXXOGWeRF039ZSFD2A$

I do have snapshots of the MGS and MDT stored on a zpool on another server.

Before i move on with that step, is there anything i should check or am missing?

Thanks,
Jesse
_______________________________________________
lustre-discuss mailing list
[email protected]
https://urldefense.com/v3/__http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org__;!!Mak6IKo!NoJ1SobOf8MgZwfTk1FBugOyXCs5fBwgdxGrxIBkpPxU4Ww53FTn9YgzpZr9BQqZlUzWdTUuFNHx_b9hAV5_NXXOGWeRF02TAyOiGQ$
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to