what it's worth here's the chunk of debug that seemed suspect. not sure if helps anything
00000008:00000020:30.0:1721247906.483639:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant()) using 4096 grant credits for brw 000000005265749c page 00000000a0744e3b 00000008:00000020:30.0:1721247906.483640:0:285997:0:(osc_cache.c:2388:osc_queue_async_io()) extent 0000000064254af2@{[1466368 -> 1470384/1470463], [2|1|-|active|wi|00000000005ec049], [16478208|4016|+|-|000000000ffe1304|4096|0000000000000000]} expanded for 1470384. 00000008:00000002:30.0:1721247906.483641:0:285997:0:(osc_io.c:278:osc_page_touch_at()) stripe KMS increasing 6022692864->6022696960 6022692864 00000008:00000020:30.0:1721247906.483642:0:285997:0:(osc_object.c:182:osc_attr_update()) set kms from 6022692864to 6022696960 00020000:00000002:30.0:1721247906.483643:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms()) MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808 a=9223372036854775808 c=9223372036854775808 b=0 00020000:00000002:30.0:1721247906.483643:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms()) MDT ID 0x21:1073784952 on OST[15]: s=6022696960 m=1721247906 a=0 c=1721247906 b=11730952 00020000:00000002:30.0:1721247906.483644:0:285997:0:(lov_object.c:1050:lov_attr_get_composite()) COMP ID #-1: s=6022696960 m=1721247906 a=0 c=1721247906 b=11730952 00000008:00000020:30.0:1721247906.483645:0:285997:0:(osc_cache.c:2355:osc_queue_async_io()) obj 00000000005ec049 ready 0|-|- wr 0|-|- rd 0|- oap 000000001625d339 page 000000002b0d2b57 added for cmd 2 00000008:00000020:30.0:1721247906.483646:0:285997:0:(osc_cache.c:1510:osc_enter_cache_try()) snx11255-OST000f-osc-ffff891b019c8000: grant { dirty: 4017/512000 dirty_pages: 12055/66011400 dropped: 0 avail: 1863200767, dirty_grant: 16478208, reserved: 0, flight: 0 } lru {in list: 1679370, left: 4019, waiters: 0 }need:28672 00000008:00000020:30.0:1721247906.483647:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant()) using 4096 grant credits for brw 000000004670e8ec page 000000002b0d2b57 00000008:00000020:30.0:1721247906.483648:0:285997:0:(osc_cache.c:2388:osc_queue_async_io()) extent 0000000064254af2@{[1466368 -> 1470385/1470463], [2|1|-|active|wi|00000000005ec049], [16482304|4017|+|-|000000000ffe1304|4096|0000000000000000]} expanded for 1470385. 00000008:00000002:30.0:1721247906.483650:0:285997:0:(osc_io.c:278:osc_page_touch_at()) stripe KMS increasing 6022696960->6022701056 6022696960 00000008:00000020:30.0:1721247906.483650:0:285997:0:(osc_object.c:182:osc_attr_update()) set kms from 6022696960to 6022701056 00020000:00000002:30.0:1721247906.483651:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms()) MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808 a=9223372036854775808 c=9223372036854775808 b=0 00020000:00000002:30.0:1721247906.483652:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms()) MDT ID 0x21:1073784952 on OST[15]: s=6022701056 m=1721247906 a=0 c=1721247906 b=11730952 00020000:00000002:30.0:1721247906.483653:0:285997:0:(lov_object.c:1050:lov_attr_get_composite()) COMP ID #-1: s=6022701056 m=1721247906 a=0 c=1721247906 b=11730952 00000008:00000020:30.0:1721247906.483654:0:285997:0:(osc_cache.c:2355:osc_queue_async_io()) obj 00000000005ec049 ready 0|-|- wr 0|-|- rd 0|- oap 0000000044245bf2 page 00000000ff11bc07 added for cmd 2 00000008:00000020:30.0:1721247906.483655:0:285997:0:(osc_cache.c:1510:osc_enter_cache_try()) snx11255-OST000f-osc-ffff891b019c8000: grant { dirty: 4018/512000 dirty_pages: 12056/66011400 dropped: 0 avail: 1863196671, dirty_grant: 16482304, reserved: 0, flight: 0 } lru {in list: 1679370, left: 4019, waiters: 0 }need:28672 --- this part above repeats for awhile 00000008:00000020:30.0:1721247906.483656:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant()) using 4096 grant credits for brw 000000006d9a7b11 page 00000000ff11bc07 00000008:00000020:30.0:1721247906.483657:0:285997:0:(osc_cache.c:2388:osc_queue_async_io()) extent 0000000064254af2@{[1466368 -> 1470386/1470463], [2|1|-|active|wi|00000000005ec049], [16486400|4018|+|-|000000000ffe1304|4096|0000000000000000]} expanded for 1470386. 00000008:00000002:30.0:1721247906.483658:0:285997:0:(osc_io.c:278:osc_page_touch_at()) stripe KMS increasing 6022701056->6022701376 6022701056 00000008:00000020:30.0:1721247906.483659:0:285997:0:(osc_object.c:182:osc_attr_update()) set kms from 6022701056to 6022701376 00000008:00000020:30.0:1721247906.483660:0:285997:0:(osc_request.c:763:osc_update_next_shrink()) next time 608550 to shrink grant 00000080:00200000:30.0:1721247906.483664:0:285997:0:(vvp_io.c:1200:vvp_io_write_commit()) Committed 14993 pages 61407552 bytes, tot: 396953728 00020000:00000002:30.0:1721247906.483665:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms()) MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808 a=9223372036854775808 c=9223372036854775808 b=0 00020000:00000002:30.0:1721247906.483666:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms()) MDT ID 0x21:1073784952 on OST[15]: s=6022701376 m=1721247906 a=0 c=1721247906 b=11730952 00020000:00000002:30.0:1721247906.483667:0:285997:0:(lov_object.c:1050:lov_attr_get_composite()) COMP ID #-1: s=6022701376 m=1721247906 a=0 c=1721247906 b=11730952 00000080:00200000:30.0:1721247906.483668:0:285997:0:(file.c:1490:ll_merge_attr()) [0x24000a878:0x21:0x0] updating i_size 6022701376 00000080:00200000:30.0:1721247906.483671:0:285997:0:(vvp_io.c:1364:vvp_io_write_start()) bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: write nob 0, result: 396953728 00000008:00000020:30.0:1721247906.483674:0:285997:0:(osc_cache.c:1709:osc_update_pending()) obj 00000000005ec049 ready 0|-|- wr 4019|-|- rd 0|- update pending cmd 2 delta 4019. 00000080:00200000:30.0:1721247906.483687:0:285997:0:(vvp_io.c:537:vvp_io_advance()) advancing 396953728 bytes 00000080:00200000:30.0:1721247906.483690:0:285997:0:(file.c:1768:ll_file_io_generic()) Range unlock [1373473, 1470386] 00000008:00000020:30.0:1721247906.483656:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant()) using 4096 grant credits for brw 000000006d9a7b11 page 00000000ff11bc07 00000008:00000020:30.0:1721247906.483657:0:285997:0:(osc_cache.c:2388:osc_queue_async_io()) extent 0000000064254af2@{[1466368 -> 1470386/1470463], [2|1|-|active|wi|00000000005ec049], [16486400|4018|+|-|000000000ffe1304|4096|0000000000000000]} expanded for 1470386. 00000008:00000002:30.0:1721247906.483658:0:285997:0:(osc_io.c:278:osc_page_touch_at()) stripe KMS increasing 6022701056->6022701376 6022701056 00000008:00000020:30.0:1721247906.483659:0:285997:0:(osc_object.c:182:osc_attr_update()) set kms from 6022701056to 6022701376 00000008:00000020:30.0:1721247906.483660:0:285997:0:(osc_request.c:763:osc_update_next_shrink()) next time 608550 to shrink grant 00000080:00200000:30.0:1721247906.483664:0:285997:0:(vvp_io.c:1200:vvp_io_write_commit()) Committed 14993 pages 61407552 bytes, tot: 396953728 00020000:00000002:30.0:1721247906.483665:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms()) MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808 a=9223372036854775808 c=9223372036854775808 b=0 00020000:00000002:30.0:1721247906.483666:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms()) MDT ID 0x21:1073784952 on OST[15]: s=6022701376 m=1721247906 a=0 c=1721247906 b=11730952 00020000:00000002:30.0:1721247906.483667:0:285997:0:(lov_object.c:1050:lov_attr_get_composite()) COMP ID #-1: s=6022701376 m=1721247906 a=0 c=1721247906 b=11730952 00000080:00200000:30.0:1721247906.483668:0:285997:0:(file.c:1490:ll_merge_attr()) [0x24000a878:0x21:0x0] updating i_size 6022701376 00000080:00200000:30.0:1721247906.483671:0:285997:0:(vvp_io.c:1364:vvp_io_write_start()) bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: write nob 0, result: 396953728 00000008:00000020:30.0:1721247906.483674:0:285997:0:(osc_cache.c:1709:osc_update_pending()) obj 00000000005ec049 ready 0|-|- wr 4019|-|- rd 0|- update pending cmd 2 delta 4019. 00000080:00200000:30.0:1721247906.483687:0:285997:0:(vvp_io.c:537:vvp_io_advance()) advancing 396953728 bytes 00000080:00200000:30.0:1721247906.483690:0:285997:0:(file.c:1768:ll_file_io_generic()) Range unlock [1373473, 1470386] 00000080:00200000:30.0:1721247906.483693:0:285997:0:(vvp_io.c:307:vvp_io_fini()) [0x24000a878:0x21:0x0] ignore/verify layout 0/0, layout version 0 need write layout 0, restore needed 0 00000080:00200000:30.0:1721247906.483694:0:285997:0:(file.c:1809:ll_file_io_generic()) bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: 2 io complete with rc: 0, result: 2007566464, restart: 0 00000080:00200000:30.0:1721247906.483697:0:285997:0:(file.c:1874:ll_file_io_generic()) iot: 2, result: 2007566464 00000080:00200008:30.0:1721247906.487347:0:285997:0:(file.c:2167:ll_file_write_iter()) file bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt:[0x24000a878:0x21:0x0], ppos: 6022701376, count: 1200 00000080:00200000:30.0:1721247906.487354:0:285997:0:(rw26.c:731:ll_write_begin()) Writing 1470386 of 320 to 1200 bytes 00000080:00200000:30.0:1721247906.487358:0:285997:0:(rw26.c:917:ll_write_end()) pos 6022701376, len 1200, copied 1200 00000008:00000002:30.0:1721247906.487364:0:285997:0:(osc_io.c:278:osc_page_touch_at()) stripe KMS increasing 6022701376->6022702576 6022701376 00000008:00000020:30.0:1721247906.487366:0:285997:0:(osc_object.c:182:osc_attr_update()) set kms from 6022701376to 6022702576 00000080:00200000:30.0:1721247906.487370:0:285997:0:(file.c:2146:ll_do_tiny_write()) result: 1200, original count 1200 00020000:00000002:30.0:1721247906.487378:0:285997:0:(lov_object.c:1182:lov_conf_freeze()) To take share lov(00000000e2618e16) owner 0000000000000000/0000000065ef6d4c 00020000:00000002:30.0:1721247906.487380:0:285997:0:(lov_object.c:1190:lov_conf_thaw()) To release share lov(00000000e2618e16) owner 0000000000000000/0000000065ef6d4c 00000080:00200000:30.0:1721247906.487384:0:285997:0:(file.c:405:ll_file_release()) VFS Op:inode=[0x24000a878:0x21:0x0](000000005f8eb1b5) 00020000:00000002:30.0:1721247906.487385:0:285997:0:(lov_object.c:1182:lov_conf_freeze()) To take share lov(00000000e2618e16) owner 0000000000000000/0000000065ef6d4c 00020000:00000002:30.0:1721247906.487386:0:285997:0:(lov_object.c:1190:lov_conf_thaw()) To release share lov(00000000e2618e16) owner 0000000000000000/0000000065ef6d4c 00800000:00000002:30.0:1721247906.487389:0:285997:0:(lmv_obd.c:3572:lmv_lock_match()) Lock match for [0x24000a878:0x21:0x0] 00800000:00000002:30.0:1721247906.487393:0:285997:0:(lmv_fld.c:77:lmv_fld_lookup()) FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0] 00800000:00000002:30.0:1721247906.487408:0:285997:0:(lmv_fld.c:77:lmv_fld_lookup()) FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0] 00800000:00000002:30.0:1721247906.487409:0:285997:0:(lmv_obd.c:1457:lmv_close()) CLOSE [0x24000a878:0x21:0x0] 00000002:00000002:30.0:1721247906.487411:0:285997:0:(mdc_request.c:902:mdc_close()) snx11255-MDT0001-mdc-ffff891b019c8000: [0x24000a878:0x21:0x0] file closed with intent: 200 00000002:00100000:30.0:1721247906.487418:0:285997:0:(mdc_request.c:944:mdc_close()) @@@ matched open req@00000000da251270 x1804223383477312/t99088005029(99088005029) o101->snx11255-MDT0001-mdc-ffff891b019c8000@172.18.20.21@o2ib100:12/10 lens 616/656 e 0 to 0 dl 1721247862 ref 1 fl Complete:RPQU/4/ffffffff rc 0/-1 job:'' 00000020:00100000:30.0:1721247906.487431:0:285997:0:(genops.c:2299:obd_get_mod_rpc_slot()) snx11255-MDT0001-mdc-ffff891b019c8000: modify RPC slot 1 is allocated opc 35, max 7 00000100:00100000:30.0:1721247906.487434:0:285997:0:(client.c:742:ptlrpc_reassign_next_xid()) @@@ reassign xid req@000000007e3dc9a5 x1804223383853824/t0(0) o35->snx11255-MDT0001-mdc-ffff891b019c8000@172.18.20.21@o2ib100:23/10 lens 392/1104 e 0 to 0 dl 0 ref 1 fl New:QU/0/ffffffff rc 0/-1 job:'' 00000100:00100000:30.0:1721247906.487443:0:285997:0:(client.c:1727:ptlrpc_send_new_req()) Sending RPC req@000000007e3dc9a5 pname:cluuid:pid:xid:nid:opc:job pt_main_thread:409f80a0-cdfe-466d-9a49-014ed62a4391:285997:1804223383853824:172.18.20.21@o2ib100:35: 00000100:00100000:30.0:1721247906.487486:0:285997:0:(client.c:2491:ptlrpc_set_wait()) set 000000000180e080 going to sleep for 11 seconds 00000100:00100000:30.0:1721247906.487761:0:285997:0:(client.c:2207:ptlrpc_check_set()) Completed RPC req@000000007e3dc9a5 pname:cluuid:pid:xid:nid:opc:job pt_main_thread:409f80a0-cdfe-466d-9a49-014ed62a4391:285997:1804223383853824:172.18.20.21@o2ib100:35: 00800000:00000002:30.0:1721247906.487766:0:285997:0:(lmv_fld.c:77:lmv_fld_lookup()) FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0] 00000002:00100000:30.0:1721247906.487768:0:285997:0:(mdc_request.c:852:mdc_free_open()) @@@ free open request, rq_replay=0 req@00000000da251270 x1804223383477312/t99088005029(99088005029) o101->snx11255-MDT0001-mdc-ffff891b019c8000@172.18.20.21@o2ib100:12/10 lens 616/656 e 0 to 0 dl 1721247862 ref 1 fl Complete:RQU/4 /ffffffff rc 0/-1 job:'' --- this part repeats with no other output 00000080:00200000:8.0:1721247906.528444:0:72:0:(vvp_io.c:1788:vvp_io_init()) [0x24000a876:0x26:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 00020000:00000002:8.0:1721247906.528446:0:72:0:(lov_object.c:1486:lov_io_init()) [0x24000a876:0x26:0x0]io 000000001dcefeeb type 6 ignore/verify layout 1/0 00020000:00000002:8.0:1721247906.528447:0:72:0:(lov_object.c:1182:lov_conf_freeze()) To take share lov(0000000018a6920a) owner 0000000000000000/0000000096d0da59 00020000:00000002:8.0:1721247906.528448:0:72:0:(lov_object.c:1190:lov_conf_thaw()) To release share lov(0000000018a6920a) owner 0000000000000000/0000000096d0da59 00020000:00200000:8.0:1721247906.528449:0:72:0:(lov_io.c:824:lov_io_iter_init()) component[0] flags 0x10 00020000:00000002:8.0:1721247906.528450:0:72:0:(lov_offset.c:266:lov_stripe_intersects()) [0->-1] -> [(0) 0->-1 (0)] 00000008:00000020:8.0:1721247906.528452:0:72:0:(osc_cache.c:3049:osc_cache_writeback_range()) obj 00000000c9987be6 ready 0|-|- wr 0|-|- rd 0|- pageout [0, 18446744073709551615], 0. 00000080:00200000:8.0:1721247906.528454:0:72:0:(vvp_io.c:307:vvp_io_fini()) [0x24000a876:0x26:0x0] ignore/verify layout 1/0, layout version 0 need write layout 0, restore needed 0 00000080:00200000:8.0:1721247906.528455:0:72:0:(vvp_io.c:1788:vvp_io_init()) [0x24000a876:0x2b:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 00020000:00000002:8.0:1721247906.528456:0:72:0:(lov_object.c:1486:lov_io_init()) [0x24000a876:0x2b:0x0]io 000000001dcefeeb type 6 ignore/verify layout 1/0 00020000:00000002:8.0:1721247906.528457:0:72:0:(lov_object.c:1182:lov_conf_freeze()) To take share lov(0000000020610153) owner 0000000000000000/0000000096d0da59 00020000:00000002:8.0:1721247906.528458:0:72:0:(lov_object.c:1190:lov_conf_thaw()) To release share lov(0000000020610153) owner 0000000000000000/0000000096d0da59 00020000:00200000:8.0:1721247906.528458:0:72:0:(lov_io.c:824:lov_io_iter_init()) component[0] flags 0x10 00020000:00000002:8.0:1721247906.528459:0:72:0:(lov_offset.c:266:lov_stripe_intersects()) [0->-1] -> [(0) 0->-1 (0)] 00000008:00000020:8.0:1721247906.528460:0:72:0:(osc_cache.c:3049:osc_cache_writeback_range()) obj 000000005da60c8b ready 0|-|- wr 0|-|- rd 0|- pageout [0, 18446744073709551615], 0. 00000080:00200000:8.0:1721247906.528462:0:72:0:(vvp_io.c:307:vvp_io_fini()) [0x24000a876:0x2b:0x0] ignore/verify layout 1/0, layout version 0 need write layout 0, restore needed 0 00000080:00200000:8.0:1721247906.528463:0:72:0:(vvp_io.c:1788:vvp_io_init()) [0x24000a876:0x29:0x0] ignore/verify layout 1/0, layout version 0 restore needed 0 00020000:00000002:8.0:1721247906.528464:0:72:0:(lov_object.c:1486:lov_io_init()) [0x24000a876:0x29:0x0]io 000000001dcefeeb type 6 ignore/verify layout 1/0 00020000:00000002:8.0:1721247906.528465:0:72:0:(lov_object.c:1182:lov_conf_freeze()) To take share lov(00000000c3ae5165) owner 0000000000000000/0000000096d0da59 00020000:00000002:8.0:1721247906.528465:0:72:0:(lov_object.c:1190:lov_conf_thaw()) To release share lov(00000000c3ae5165) owner 0000000000000000/0000000096d0da59 On Fri, Jul 19, 2024 at 8:58 AM Michael DiDomenico <mdidomeni...@gmail.com> wrote: > > interestingly enough i ran > > cut -f4 -d: lustre_debug* | cut -f1 -d. | sort | uniq -c > > and while the timeline doesn't show a big gap, the number of records > per time interval does. i can clearly see the slope of the job and > exactly where it seems to have fallen off > > i might be able to get the small chunk of the log exported, but i'll > paraphrase the log a little below > > starting with many entries repeat that repeat > > using 4096 grant cretis for brw > osc grant > stripe kms increasing > set kms from > mdt id initial value > mdt id on ost > comp id #-1 > obj ready > osc grant > > then it stops and this comes out > > ll_mert_attr update i_sze > bf16_zero_pp_rank_7_mp_rank_01_optim_Stats.pt > obj ready update pending cmd > advancing bytes > range unlock > vvp_io_fini() > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt > writing x of x to x bytes > pos len copied > stripe kms increasing > set kms from > ll_do_tiny_write() > to take share lv owner > to release share lov owner > vfs op:onoe > to take share_lov > to_release share lov > lock match for > FLD lookup got mds 1 > fld lookup got mds > CLOSE > mdc file closed with intent 200 > @@@ matched open mdc > mdc modify rpc slot 1 > @@@ reassign xid > sending RPC > set going to sleep for 11 seconds > completed rpc > fld lookup got mds 1 > @@@ free open request > > after the last command, it then spirals out repeating until the end of the log > > vvp_io_init > lov_io_init > lov_conf_freeze to take share > lov_conf_thaw to release share > > On Thu, Jul 18, 2024 at 3:24 PM Oleg Drokin <gr...@whamcloud.com> wrote: > > > > On Thu, 2024-07-18 at 13:26 -0400, Michael DiDomenico wrote: > > > there's nothing 'grep -i evict' in the lustre_debug logs or from the > > > storage console logs > > > > hm, ok. > > > > > > perhaps see continuity of the timestamps and what happened right > > > > before > > > > and right after the gap if there is one in the times? > > > > > > > > > > i pulled a counter from the logs of the functions calls, maybe one of > > > these looks off (this is just the ones over 100k), please excuse > > > typos > > > > > > $ grep -vh "^$" lustre_debug*.log | cut -f10 -d: | cut -f1 -d\) | > > > sort > > > > uniq -c | sort -n > > > 105034 lov_io_init > > > 105034 vvp_io_init > > > 105035 lov_io_iter_init > > > 105035 lob_strip_intersects > > > 105035 osc_cache_writeback_range > > > 105043 vvp_io_fini > > > 105050 lov_conf_freeze > > > 105050 lov_conf_thaw > > > 294806 osc_attr-update > > > 294806 osc_page_touch_at > > > 294814 osc_consume_write_grant > > > 294815 lov_attr_get_composite > > > 294816 osc_enter_cache_try > > > 351044 ll_write_end > > > 589549 osc_queueu_async_io > > > 589617 lov_merge_lvm_kms > > > > that's not really going to do anything useful, there's a timestamp in > > unix time as the fourth field (separated with colons), see if there are > > gaps there. > > I imagine there's going to be real dense (time-wise) activity) then an > > RPC is prepared and send (Sending RPC ....) and then a lot sparser > > activity perhaps with multi-second pauses) and then eventually it'll > > pick up after gettign a server response for example? > > Though none of that explains why lctl would hang I guess, but still > > _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org