Hello again,
Yesterday the MDS server crashed twice (the whole machine).
The first one was berore 22:57. The second one was at 00:15 of today.
Here you can see the Lustre related logs. The server was manually
rebooted from the first hang at 22:57 and Lustre started the MDT
recovery. After recovery, the whole system was working 'propertly' until
23:00 where the data started to be unaccesible for the clients. Finally,
the server hangs at 00:15, but the last lustre log is at 23:26.
Here I can see a different line I have not seen before: "/$$$ failed to
release quota space on glimpse 0!=60826269226353608"/
/
/
/Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Imperative
Recovery not enabled, recovery window 300-900
Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: in recovery
but waiting for the first client to connect
Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Will be in
recovery for at least 5:00, or until 125 clients reconnect
Jan 24 22:57:13 srv-lustre11 kernel: LustreError:
3949134:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req@000000003892d67b x1812509961058304/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0c20_UUID@10.5.33.243@o2ib1:274/0 lens 336/0
e 0 to 0 dl 1737755839 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:13 srv-lustre11 kernel: LustreError:
3949134:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 1
previous similar message
Jan 24 22:57:20 srv-lustre11 kernel: LustreError:
3949407:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req@000000009a279624 x1812509773308160/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0fa7_UUID@10.5.33.244@o2ib1:281/0 lens 336/0
e 0 to 0 dl 1737755846 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:20 srv-lustre11 kernel: LustreError:
3949407:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 9
previous similar messages
Jan 24 22:57:21 srv-lustre11 kernel: LustreError:
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req@000000000db38b1b x1812509961083456/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0c1e_UUID@10.5.33.243@o2ib1:282/0 lens 336/0
e 0 to 0 dl 1737755847 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:21 srv-lustre11 kernel: LustreError:
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 12
previous similar messages
Jan 24 22:57:24 srv-lustre11 kernel: LustreError:
3949411:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req@0000000034e830d1 x1812509773318336/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0fa1_UUID@10.5.33.244@o2ib1:285/0 lens 336/0
e 0 to 0 dl 1737755850 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:24 srv-lustre11 kernel: LustreError:
3949411:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 8
previous similar messages
Jan 24 22:57:30 srv-lustre11 kernel: LustreError:
3949406:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req@00000000e40a36e5 x1812509961108224/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0bbc_UUID@10.5.33.243@o2ib1:291/0 lens 336/0
e 0 to 0 dl 1737755856 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:30 srv-lustre11 kernel: LustreError:
3949406:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 24
previous similar messages
Jan 24 22:57:38 srv-lustre11 kernel: LustreError:
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req@000000004a78941b x1812509961124480/t0(0)
o601->LUSTRE-MDT0000-lwp-OST0c1d_UUID@10.5.33.243@o2ib1:299/0 lens 336/0
e 0 to 0 dl 1737755864 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:38 srv-lustre11 kernel: LustreError:
3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 57
previous similar messages
Jan 24 22:57:57 srv-lustre11 kernel: LustreError:
3949482:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not
permitted during recovery req@000000002220d707 x1812509773390720/t0(0)
o601->LUSTRE-MDT0000-lwp-OST139c_UUID@10.5.33.244@o2ib1:318/0 lens 336/0
e 0 to 0 dl 1737755883 ref 1 fl Interpret:/0/ffffffff rc 0/-1
job:'lquota_wb_LUSTR.0'
Jan 24 22:57:57 srv-lustre11 kernel: LustreError:
3949482:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 99
previous similar messages
Jan 24 22:58:15 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Recovery
over after 1:07, of 125 clients 125 recovered and 0 were evicted.
Jan 24 22:58:50 srv-lustre11 kernel: LustreError:
3949159:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
uuid:LUSTRE-MDT0000-lwp-OST0bc4_UUID release: 60826269226353608
granted:66040, total:13781524 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2949
enforced:1 hard:62914560 soft:52428800 granted:13781524 time:0 qunit:
262144 edquot:0 may_rel:0 revoke:0 default:yes
Jan 24 22:58:50 srv-lustre11 kernel: LustreError:
3949159:0:(qmt_lock.c:425:qmt_lvbo_update()) *$$$ failed to release
quota space on glimpse 0!=60826269226353608* : rc = -22#012
qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2949 enforced:1 hard:62914560
soft:52428800 granted:13781524 time:0 qunit: 262144 edquot:0 may_rel:0
revoke:0 default:yes
Jan 24 23:08:52 srv-lustre11 kernel: Lustre: LUSTRE-OST1389-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:09:39 srv-lustre11 kernel: Lustre: LUSTRE-OST138b-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:10:24 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST138d-osc-MDT0000: operation ost_connect to node
10.5.33.245@o2ib1 failed: rc = -19
Jan 24 23:10:32 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node
10.5.33.245@o2ib1 failed: rc = -19
Jan 24 23:10:32 srv-lustre11 kernel: LustreError: Skipped 5 previous
similar messages
Jan 24 23:11:18 srv-lustre11 kernel: Lustre: LUSTRE-OST138d-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:11:25 srv-lustre11 kernel: Lustre: LUSTRE-OST138a-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:12:09 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST1390-osc-MDT0000: operation ost_connect to node
10.5.33.244@o2ib1 failed: rc = -19
Jan 24 23:12:09 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:12:09 srv-lustre11 kernel: Lustre: LUSTRE-OST138e-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:12:23 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node
10.5.33.245@o2ib1 failed: rc = -19
Jan 24 23:12:23 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:12:58 srv-lustre11 kernel: Lustre: LUSTRE-OST138f-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:13:46 srv-lustre11 kernel: Lustre: LUSTRE-OST1390-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:13:46 srv-lustre11 kernel: Lustre: Skipped 1 previous similar
message
Jan 24 23:14:35 srv-lustre11 kernel: Lustre: LUSTRE-OST1391-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:14:36 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST1392-osc-MDT0000: operation ost_connect to node
10.5.33.245@o2ib1 failed: rc = -19
Jan 24 23:14:36 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:16:48 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node
10.5.33.244@o2ib1 failed: rc = -19
Jan 24 23:16:48 srv-lustre11 kernel: LustreError: Skipped 4 previous
similar messages
Jan 24 23:17:02 srv-lustre11 kernel: Lustre: LUSTRE-OST03f3-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:17:02 srv-lustre11 kernel: Lustre: Skipped 1 previous similar
message
Jan 24 23:19:33 srv-lustre11 kernel: Lustre: LUSTRE-OST03f6-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:19:33 srv-lustre11 kernel: Lustre: Skipped 2 previous similar
messages
Jan 24 23:19:41 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node
10.5.33.244@o2ib1 failed: rc = -19
Jan 24 23:19:41 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:22:11 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node
10.5.33.245@o2ib1 failed: rc = -19
Jan 24 23:22:11 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:23:59 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node
10.5.33.245@o2ib1 failed: rc = -19
Jan 24 23:23:59 srv-lustre11 kernel: LustreError: Skipped 3 previous
similar messages
Jan 24 23:24:29 srv-lustre11 kernel: Lustre: LUSTRE-OST03fc-osc-MDT0000:
Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1)
Jan 24 23:24:29 srv-lustre11 kernel: Lustre: Skipped 5 previous similar
messages
Jan 24 23:26:33 srv-lustre11 kernel: LustreError: 11-0:
LUSTRE-OST13f0-osc-MDT0000: operation ost_connect to node
10.5.33.244@o2ib1 failed: rc = -19
Jan 24 23:26:33 srv-lustre11 kernel: LustreError: Skipped 5 previous
similar messages/
Thanks.
Jose.
El 21/01/2025 a las 10:34, Jose Manuel Martínez García escribió:
Hello everybody.
I am dealing with an issue with a relatively new Lustre installation.
The Metadata Server (MDS) hangs randomly without any common pattern.
It can take anywhere from 30 minutes to 30 days, but it always ends up
hanging without a consistent pattern (at least, I haven't found one).
The logs don't show anything unusual at the time of the failure. The
only thing I continuously see are these messages:
/[lun ene 20 14:17:10 2025] LustreError:
7068:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
-22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:17:10 2025] LustreError:
7068:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
similar messages
[lun ene 20 14:21:52 2025] LustreError:
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
uuid:LUSTRE-MDT0000-lwp-OST0c1f_UUID release: 15476132855418716160
granted:262144, total:14257500 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2582
enforced:1 hard:62914560 soft:52428800 granted:14257500 time:0 qunit:
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:21:52 2025] LustreError:
1947381:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
uuid:LUSTRE-MDT0000-lwp-OST0fb2_UUID release: 13809297465413342331
granted:66568, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325
enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit:
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:21:52 2025] LustreError:
1947381:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous
similar messages
[lun ene 20 14:21:52 2025] LustreError:
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous
similar messages
[lun ene 20 14:27:24 2025] LustreError:
7047:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
-22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:27:24 2025] LustreError:
7047:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
similar messages
[lun ene 20 14:31:52 2025] LustreError:
1844354:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
uuid:LUSTRE-MDT0000-lwp-OST1399_UUID release: 12882711387029922688
granted:66116, total:14078012 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2586
enforced:1 hard:62914560 soft:52428800 granted:14078012 time:0 qunit:
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:31:52 2025] LustreError:
1844354:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 785 previous
similar messages
[lun ene 20 14:37:39 2025] LustreError:
7054:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
-22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:37:39 2025] LustreError:
7054:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
similar messages
[lun ene 20 14:41:54 2025] LustreError:
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much!
uuid:LUSTRE-MDT0000-lwp-OST0faa_UUID release: 13811459193234480169
granted:65632, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325
enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit:
262144 edquot:0 may_rel:0 revoke:0 default:yes
[lun ene 20 14:41:54 2025] LustreError:
1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 798 previous
similar messages
[lun ene 20 14:47:53 2025] LustreError:
7052:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with
-22, flags:0x4 qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1
granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636
qunit:262144 qtune:65536 edquot:0 default:yes
[lun ene 20 14:47:53 2025] LustreError:
7052:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous
similar messages
/
I have ruled out hardware failure since the MDS service has been moved
between different servers, and it happens with all of them.
Linux distribution: AlmaLinux release 8.10 (Cerulean Leopard)
Kernel: Linux srv-lustre15 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri
Jun 28 18:44:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Lustre release: lustre-2.15.5-1.el8.x86_64
Not using ZFS.
Any ideas on where to continue investigating?
Is the error appearing in dmesg a bug, or is it a corruption in the
quota database?
The possible bugs affecting quotas that might be related seem to be
fixed in version 2.15.
Thanks in advance.
--
- no title specified
Jose Manuel Martínez García
Coordinador de Sistemas
Supercomputación de Castilla y León
Tel: 987 293 174
Edificio CRAI-TIC, Campus de Vegazana, s/n Universidad de León - 24071
León, España
<https://www.scayle.es/>
Le informamos, como destinatario de este mensaje, que el correo
electrónico y las comunicaciones por medio de Internet no permiten
asegurar ni garantizar la confidencialidad de los mensajes
transmitidos, así como tampoco su integridad o su correcta recepción,
por lo que SCAYLE no asume responsabilidad alguna por tales
circunstancias. Si no consintiese en la utilización del correo
electrónico o de las comunicaciones vía Internet le rogamos nos lo
comunique y ponga en nuestro conocimiento de manera inmediata. Para
más información visite nuestro Aviso Legal
<https://www.scayle.es/aviso-legal/>.
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org