Hello again,

Yesterday the MDS server crashed twice (the whole machine).

The first one was berore 22:57. The second one was at 00:15 of today.

Here you can see the Lustre related logs. The server was manually rebooted from the first hang at 22:57 and Lustre started the MDT recovery. After recovery, the whole system was working 'propertly' until 23:00 where the data started to be unaccesible for the clients. Finally, the server hangs at 00:15, but the last lustre log is at 23:26.


Here I can see a different line I have not seen before: "/$$$ failed to release quota space on glimpse 0!=60826269226353608"/

/
/



/Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Imperative Recovery not enabled, recovery window 300-900 Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: in recovery but waiting for the first client to connect Jan 24 22:57:08 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Will be in recovery for at least 5:00, or until 125 clients reconnect Jan 24 22:57:13 srv-lustre11 kernel: LustreError: 3949134:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@000000003892d67b x1812509961058304/t0(0) o601->LUSTRE-MDT0000-lwp-OST0c20_UUID@10.5.33.243@o2ib1:274/0 lens 336/0 e 0 to 0 dl 1737755839 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'lquota_wb_LUSTR.0' Jan 24 22:57:13 srv-lustre11 kernel: LustreError: 3949134:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 1 previous similar message Jan 24 22:57:20 srv-lustre11 kernel: LustreError: 3949407:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@000000009a279624 x1812509773308160/t0(0) o601->LUSTRE-MDT0000-lwp-OST0fa7_UUID@10.5.33.244@o2ib1:281/0 lens 336/0 e 0 to 0 dl 1737755846 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'lquota_wb_LUSTR.0' Jan 24 22:57:20 srv-lustre11 kernel: LustreError: 3949407:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 9 previous similar messages Jan 24 22:57:21 srv-lustre11 kernel: LustreError: 3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@000000000db38b1b x1812509961083456/t0(0) o601->LUSTRE-MDT0000-lwp-OST0c1e_UUID@10.5.33.243@o2ib1:282/0 lens 336/0 e 0 to 0 dl 1737755847 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'lquota_wb_LUSTR.0' Jan 24 22:57:21 srv-lustre11 kernel: LustreError: 3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 12 previous similar messages Jan 24 22:57:24 srv-lustre11 kernel: LustreError: 3949411:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@0000000034e830d1 x1812509773318336/t0(0) o601->LUSTRE-MDT0000-lwp-OST0fa1_UUID@10.5.33.244@o2ib1:285/0 lens 336/0 e 0 to 0 dl 1737755850 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'lquota_wb_LUSTR.0' Jan 24 22:57:24 srv-lustre11 kernel: LustreError: 3949411:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 8 previous similar messages Jan 24 22:57:30 srv-lustre11 kernel: LustreError: 3949406:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@00000000e40a36e5 x1812509961108224/t0(0) o601->LUSTRE-MDT0000-lwp-OST0bbc_UUID@10.5.33.243@o2ib1:291/0 lens 336/0 e 0 to 0 dl 1737755856 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'lquota_wb_LUSTR.0' Jan 24 22:57:30 srv-lustre11 kernel: LustreError: 3949406:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 24 previous similar messages Jan 24 22:57:38 srv-lustre11 kernel: LustreError: 3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@000000004a78941b x1812509961124480/t0(0) o601->LUSTRE-MDT0000-lwp-OST0c1d_UUID@10.5.33.243@o2ib1:299/0 lens 336/0 e 0 to 0 dl 1737755864 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'lquota_wb_LUSTR.0' Jan 24 22:57:38 srv-lustre11 kernel: LustreError: 3949413:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 57 previous similar messages Jan 24 22:57:57 srv-lustre11 kernel: LustreError: 3949482:0:(tgt_handler.c:539:tgt_filter_recovery_request()) @@@ not permitted during recovery  req@000000002220d707 x1812509773390720/t0(0) o601->LUSTRE-MDT0000-lwp-OST139c_UUID@10.5.33.244@o2ib1:318/0 lens 336/0 e 0 to 0 dl 1737755883 ref 1 fl Interpret:/0/ffffffff rc 0/-1 job:'lquota_wb_LUSTR.0' Jan 24 22:57:57 srv-lustre11 kernel: LustreError: 3949482:0:(tgt_handler.c:539:tgt_filter_recovery_request()) Skipped 99 previous similar messages Jan 24 22:58:15 srv-lustre11 kernel: Lustre: LUSTRE-MDT0000: Recovery over after 1:07, of 125 clients 125 recovered and 0 were evicted. Jan 24 22:58:50 srv-lustre11 kernel: LustreError: 3949159:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! uuid:LUSTRE-MDT0000-lwp-OST0bc4_UUID release: 60826269226353608 granted:66040, total:13781524  qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2949 enforced:1 hard:62914560 soft:52428800 granted:13781524 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:yes Jan 24 22:58:50 srv-lustre11 kernel: LustreError: 3949159:0:(qmt_lock.c:425:qmt_lvbo_update()) *$$$ failed to release quota space on glimpse 0!=60826269226353608* : rc = -22#012  qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2949 enforced:1 hard:62914560 soft:52428800 granted:13781524 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:yes Jan 24 23:08:52 srv-lustre11 kernel: Lustre: LUSTRE-OST1389-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:09:39 srv-lustre11 kernel: Lustre: LUSTRE-OST138b-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:10:24 srv-lustre11 kernel: LustreError: 11-0: LUSTRE-OST138d-osc-MDT0000: operation ost_connect to node 10.5.33.245@o2ib1 failed: rc = -19 Jan 24 23:10:32 srv-lustre11 kernel: LustreError: 11-0: LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node 10.5.33.245@o2ib1 failed: rc = -19 Jan 24 23:10:32 srv-lustre11 kernel: LustreError: Skipped 5 previous similar messages Jan 24 23:11:18 srv-lustre11 kernel: Lustre: LUSTRE-OST138d-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:11:25 srv-lustre11 kernel: Lustre: LUSTRE-OST138a-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:12:09 srv-lustre11 kernel: LustreError: 11-0: LUSTRE-OST1390-osc-MDT0000: operation ost_connect to node 10.5.33.244@o2ib1 failed: rc = -19 Jan 24 23:12:09 srv-lustre11 kernel: LustreError: Skipped 3 previous similar messages Jan 24 23:12:09 srv-lustre11 kernel: Lustre: LUSTRE-OST138e-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:12:23 srv-lustre11 kernel: LustreError: 11-0: LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node 10.5.33.245@o2ib1 failed: rc = -19 Jan 24 23:12:23 srv-lustre11 kernel: LustreError: Skipped 3 previous similar messages Jan 24 23:12:58 srv-lustre11 kernel: Lustre: LUSTRE-OST138f-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:13:46 srv-lustre11 kernel: Lustre: LUSTRE-OST1390-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:13:46 srv-lustre11 kernel: Lustre: Skipped 1 previous similar message Jan 24 23:14:35 srv-lustre11 kernel: Lustre: LUSTRE-OST1391-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:14:36 srv-lustre11 kernel: LustreError: 11-0: LUSTRE-OST1392-osc-MDT0000: operation ost_connect to node 10.5.33.245@o2ib1 failed: rc = -19 Jan 24 23:14:36 srv-lustre11 kernel: LustreError: Skipped 3 previous similar messages Jan 24 23:16:48 srv-lustre11 kernel: LustreError: 11-0: LUSTRE-OST13ef-osc-MDT0000: operation ost_connect to node 10.5.33.244@o2ib1 failed: rc = -19 Jan 24 23:16:48 srv-lustre11 kernel: LustreError: Skipped 4 previous similar messages Jan 24 23:17:02 srv-lustre11 kernel: Lustre: LUSTRE-OST03f3-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:17:02 srv-lustre11 kernel: Lustre: Skipped 1 previous similar message Jan 24 23:19:33 srv-lustre11 kernel: Lustre: LUSTRE-OST03f6-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:19:33 srv-lustre11 kernel: Lustre: Skipped 2 previous similar messages Jan 24 23:19:41 srv-lustre11 kernel: LustreError: 11-0: LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node 10.5.33.244@o2ib1 failed: rc = -19 Jan 24 23:19:41 srv-lustre11 kernel: LustreError: Skipped 3 previous similar messages Jan 24 23:22:11 srv-lustre11 kernel: LustreError: 11-0: LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node 10.5.33.245@o2ib1 failed: rc = -19 Jan 24 23:22:11 srv-lustre11 kernel: LustreError: Skipped 3 previous similar messages Jan 24 23:23:59 srv-lustre11 kernel: LustreError: 11-0: LUSTRE-OST13ed-osc-MDT0000: operation ost_connect to node 10.5.33.245@o2ib1 failed: rc = -19 Jan 24 23:23:59 srv-lustre11 kernel: LustreError: Skipped 3 previous similar messages Jan 24 23:24:29 srv-lustre11 kernel: Lustre: LUSTRE-OST03fc-osc-MDT0000: Connection restored to 10.5.33.245@o2ib1 (at 10.5.33.245@o2ib1) Jan 24 23:24:29 srv-lustre11 kernel: Lustre: Skipped 5 previous similar messages Jan 24 23:26:33 srv-lustre11 kernel: LustreError: 11-0: LUSTRE-OST13f0-osc-MDT0000: operation ost_connect to node 10.5.33.244@o2ib1 failed: rc = -19 Jan 24 23:26:33 srv-lustre11 kernel: LustreError: Skipped 5 previous similar messages/



Thanks.

Jose.


El 21/01/2025 a las 10:34, Jose Manuel Martínez García escribió:

Hello everybody.


I am dealing with an issue with a relatively new Lustre installation. The Metadata Server (MDS) hangs randomly without any common pattern. It can take anywhere from 30 minutes to 30 days, but it always ends up hanging without a consistent pattern (at least, I haven't found one). The logs don't show anything unusual at the time of the failure. The only thing I continuously see are these messages:

/[lun ene 20 14:17:10 2025] LustreError: 7068:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 qunit:262144 qtune:65536 edquot:0 default:yes [lun ene 20 14:17:10 2025] LustreError: 7068:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous similar messages [lun ene 20 14:21:52 2025] LustreError: 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! uuid:LUSTRE-MDT0000-lwp-OST0c1f_UUID release: 15476132855418716160 granted:262144, total:14257500 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2582 enforced:1 hard:62914560 soft:52428800 granted:14257500 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:yes [lun ene 20 14:21:52 2025] LustreError: 1947381:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! uuid:LUSTRE-MDT0000-lwp-OST0fb2_UUID release: 13809297465413342331 granted:66568, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325 enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:yes [lun ene 20 14:21:52 2025] LustreError: 1947381:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous similar messages [lun ene 20 14:21:52 2025] LustreError: 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 802 previous similar messages [lun ene 20 14:27:24 2025] LustreError: 7047:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 qunit:262144 qtune:65536 edquot:0 default:yes [lun ene 20 14:27:24 2025] LustreError: 7047:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous similar messages [lun ene 20 14:31:52 2025] LustreError: 1844354:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! uuid:LUSTRE-MDT0000-lwp-OST1399_UUID release: 12882711387029922688 granted:66116, total:14078012 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2586 enforced:1 hard:62914560 soft:52428800 granted:14078012 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:yes [lun ene 20 14:31:52 2025] LustreError: 1844354:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 785 previous similar messages [lun ene 20 14:37:39 2025] LustreError: 7054:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 qunit:262144 qtune:65536 edquot:0 default:yes [lun ene 20 14:37:39 2025] LustreError: 7054:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous similar messages [lun ene 20 14:41:54 2025] LustreError: 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) $$$ Release too much! uuid:LUSTRE-MDT0000-lwp-OST0faa_UUID release: 13811459193234480169 granted:65632, total:14179564 qmt:LUSTRE-QMT0000 pool:dt-0x0 id:2325 enforced:1 hard:62914560 soft:52428800 granted:14179564 time:0 qunit: 262144 edquot:0 may_rel:0 revoke:0 default:yes [lun ene 20 14:41:54 2025] LustreError: 1895328:0:(qmt_handler.c:798:qmt_dqacq0()) Skipped 798 previous similar messages [lun ene 20 14:47:53 2025] LustreError: 7052:0:(qsd_handler.c:340:qsd_req_completion()) $$$ DQACQ failed with -22, flags:0x4  qsd:LUSTRE-OST138f qtype:prj id:2325 enforced:1 granted: 16304159618662232032 pending:0 waiting:0 req:1 usage: 114636 qunit:262144 qtune:65536 edquot:0 default:yes [lun ene 20 14:47:53 2025] LustreError: 7052:0:(qsd_handler.c:340:qsd_req_completion()) Skipped 39 previous similar messages
/
I have ruled out hardware failure since the MDS service has been moved between different servers, and it happens with all of them.

Linux distribution: AlmaLinux release 8.10 (Cerulean Leopard)
Kernel: Linux srv-lustre15 4.18.0-553.5.1.el8_lustre.x86_64 #1 SMP Fri Jun 28 18:44:24 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Lustre release: lustre-2.15.5-1.el8.x86_64
Not using ZFS.

Any ideas on where to continue investigating?
Is the error appearing in dmesg a bug, or is it a corruption in the quota database?

The possible bugs affecting quotas that might be related seem to be fixed in version 2.15.


Thanks in advance.

--
- no title specified

Jose Manuel Martínez García

Coordinador de Sistemas

Supercomputación de Castilla y León

Tel: 987 293 174

        

        

Edificio CRAI-TIC, Campus de Vegazana, s/n Universidad de León - 24071 León, España

<https://www.scayle.es/>

Le informamos, como destinatario de este mensaje, que el correo electrónico y las comunicaciones por medio de Internet no permiten asegurar ni garantizar la confidencialidad de los mensajes transmitidos, así como tampoco su integridad o su correcta recepción, por lo que SCAYLE no asume responsabilidad alguna por tales circunstancias. Si no consintiese en la utilización del correo electrónico o de las comunicaciones vía Internet le rogamos nos lo comunique y ponga en nuestro conocimiento de manera inmediata. Para más información visite nuestro Aviso Legal <https://www.scayle.es/aviso-legal/>.


_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to