Hi, We had a crash with this in MDS log:
Sep 22 13:45:07 sci-mds01 kernel: LustreError: 258240:0:(osd_handler.c:354:osd_trans_create()) 03781251-MDT0000: someone try to start transaction under readonly mode, should be disabled. Sep 22 13:45:07 sci-mds01 kernel: CPU: 31 PID: 94594 Comm: mdt_rdpg05_005 Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.6.1.el7.x86_64 #1 Sep 22 13:45:07 sci-mds01 kernel: Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021 Sep 22 13:45:07 sci-mds01 kernel: Call Trace: Sep 22 13:45:07 sci-mds01 kernel: [<ffffffff89f81400>] dump_stack+0x19/0x1b Sep 22 13:45:07 sci-mds01 kernel: [<ffffffffc143e64a>] osd_trans_create+0x3ca/0x410 [osd_zfs] Sep 22 13:45:07 sci-mds01 kernel: CPU: 10 PID: 258241 Comm: mdt_rdpg05_001 Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.6.1.el7.x86_64 #1 Sep 22 13:45:07 sci-mds01 kernel: [<ffffffffc12d885a>] top_trans_create+0x8a/0x200 [ptlrpc] Sep 22 13:45:07 sci-mds01 kernel: Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021 Sep 22 13:45:07 sci-mds01 kernel: [<ffffffffc16284dc>] lod_trans_create+0x3c/0x50 [lod] .... Looks similar to this: http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2018-August/015854.html When restarting, the MGS starts fine, but the one MDT (science-MDT0000) does not: Sep 23 16:10:17 sci-mds00 kernel: Lustre: MGS: Connection restored to 0dd6cfa0-bdf7-c8ac-7bb9-182f7874e165 (at 0@lo) Sep 23 16:10:17 sci-mds00 kernel: Lustre: Skipped 1 previous similar message Sep 23 16:10:19 sci-mds00 kernel: Lustre: 52424:0:(llog_cat.c:93:llog_cat_new_log()) science-OST1100-osc-MDT0000: there are no more free slots in catalog [0x2:0x1:0x0]:0 Sep 23 16:10:19 sci-mds00 kernel: LustreError: 52424:0:(osp_sync.c:1524:osp_sync_init()) science-OST1100-osc-MDT0000: can't initialize llog: rc = -28 Sep 23 16:10:19 sci-mds00 kernel: LustreError: 52424:0:(obd_config.c:559:class_setup()) setup science-OST1100-osc-MDT0000 failed (-28) Sep 23 16:10:19 sci-mds00 kernel: LustreError: 52424:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.120.10.90@tcp: cfg command failed: rc = -28 Sep 23 16:10:19 sci-mds00 kernel: Lustre: cmd=cf003 0:science-OST1100-osc-MDT0000 1:science-OST1100_UUID 2:10.120.10.110@tcp Sep 23 16:10:19 sci-mds00 kernel: LustreError: 15c-8: MGC10.120.10.90@tcp: The configuration from log 'science-MDT0000' failed (-28). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. Set. Sep 23 16:10:19 sci-mds00 kernel: LustreError: 52172:0:(obd_mount_server.c:1397:server_start_targets()) failed to start server science-MDT0000: -28 Sep 23 16:10:19 sci-mds00 kernel: LustreError: 52172:0:(obd_mount_server.c:1992:server_fill_super()) Unable to start targets: -28 Sep 23 16:10:19 sci-mds00 kernel: Lustre: Failing over science-MDT0000 Sep 23 16:10:19 sci-mds00 kernel: Lustre: server umount science-MDT0000 complete Sep 23 16:10:19 sci-mds00 kernel: LustreError: 52172:0:(obd_mount.c:1608:lustre_fill_super()) Unable to mount (-28) We have tried to --writeconf it, but that only moves the problem to this error when mounting an OST: Sep 23 12:04:16 sci-mds00 kernel: Lustre: MGS: Logs for fs science were removed by user request. All servers must be restarted in order to regenerate the logs: rc = 0 Sep 23 12:04:16 sci-mds00 kernel: Lustre: science-MDT0000: Imperative Recovery not enabled, recovery window 300-900 Sep 23 12:04:38 sci-mds00 kernel: Lustre: MGS: Connection restored to 68b4cd3a-6c73-19c5-2925-935e42bdaf2b (at 10.120.10.111@tcp) Sep 23 12:04:38 sci-mds00 kernel: Lustre: Skipped 2 previous similar messages Sep 23 12:04:38 sci-mds00 kernel: Lustre: MGS: Regenerating science-OST1100 log by user request: rc = 0 Sep 23 12:04:45 sci-mds00 kernel: LustreError: 5547:0:(genops.c:556:class_register_device()) science-OST1100-osc-MDT0000: already exists, won't add Sep 23 12:04:45 sci-mds00 kernel: LustreError: 5547:0:(obd_config.c:1835:class_config_llog_handler()) MGC10.120.10.90@tcp: cfg command failed: rc = -17 Sep 23 12:04:45 sci-mds00 kernel: Lustre: cmd=cf001 0:science-OST1100-osc-MDT0000 1:osp 2:science-MDT0000-mdtlov_UUID Sep 23 12:04:45 sci-mds00 kernel: LustreError: 1345:0:(mgc_request.c:599:do_requeue()) failed processing log: -17 Any ideas how to solve this. Cheers, Hans Henrik
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
