is pytorch using HDF5? i wasn't aware that it did. we are using 'flock' on all the client mounts
On Wed, Jul 24, 2024 at 10:26 PM Feng Zhang <prod.f...@gmail.com> wrote: > > Not sure if it is related, but I digged out this from my note: > > "Looks like the Lustre storage does not support the HDF5 lock. The > local hard drive($TMPDIR), or NFS works fine. If you need to read HDF5 > files on /lustre, just add: > > export HDF5_USE_FILE_LOCKING=FALSE > " > (older versions of netcdf4 and hdf5 work fine. hdf5 1.10 and after has > this issue.). SInce you are using pytorch, it might be related. > > Lustre client mounting also supports a flag to support the locking > thing...You will need to specify it explicitly when mout.(not remember > which one) > > Best, > > Feng > > On Wed, Jul 24, 2024 at 3:53 PM Feng Zhang <prod.f...@gmail.com> wrote: > > > > I have a vague memory from many years ago, that I had an issue that > > was a file lock on lustre, that caused python app to be stuck(or > > failed). > > > > Best, > > > > Feng > > > > On Wed, Jul 24, 2024 at 2:40 PM Michael DiDomenico via lustre-discuss > > <lustre-discuss@lists.lustre.org> wrote: > > > > > > what it's worth here's the chunk of debug that seemed suspect. not > > > sure if helps anything > > > > > > 00000008:00000020:30.0:1721247906.483639:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant()) > > > using 4096 grant credits for brw 000000005265749c page > > > 00000000a0744e3b > > > 00000008:00000020:30.0:1721247906.483640:0:285997:0:(osc_cache.c:2388:osc_queue_async_io()) > > > extent 0000000064254af2@{[1466368 -> 1470384/1470463], > > > [2|1|-|active|wi|00000000005ec049], > > > [16478208|4016|+|-|000000000ffe1304|4096|0000000000000000]} expanded > > > for 1470384. > > > 00000008:00000002:30.0:1721247906.483641:0:285997:0:(osc_io.c:278:osc_page_touch_at()) > > > stripe KMS increasing 6022692864->6022696960 6022692864 > > > 00000008:00000020:30.0:1721247906.483642:0:285997:0:(osc_object.c:182:osc_attr_update()) > > > set kms from 6022692864to 6022696960 > > > 00020000:00000002:30.0:1721247906.483643:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms()) > > > MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808 > > > a=9223372036854775808 c=9223372036854775808 b=0 > > > 00020000:00000002:30.0:1721247906.483643:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms()) > > > MDT ID 0x21:1073784952 on OST[15]: s=6022696960 m=1721247906 a=0 > > > c=1721247906 b=11730952 > > > 00020000:00000002:30.0:1721247906.483644:0:285997:0:(lov_object.c:1050:lov_attr_get_composite()) > > > COMP ID #-1: s=6022696960 m=1721247906 a=0 c=1721247906 b=11730952 > > > 00000008:00000020:30.0:1721247906.483645:0:285997:0:(osc_cache.c:2355:osc_queue_async_io()) > > > obj 00000000005ec049 ready 0|-|- wr 0|-|- rd 0|- oap 000000001625d339 > > > page 000000002b0d2b57 added for cmd 2 > > > 00000008:00000020:30.0:1721247906.483646:0:285997:0:(osc_cache.c:1510:osc_enter_cache_try()) > > > snx11255-OST000f-osc-ffff891b019c8000: grant { dirty: 4017/512000 > > > dirty_pages: 12055/66011400 dropped: 0 avail: 1863200767, dirty_grant: > > > 16478208, reserved: 0, flight: 0 } lru {in > > > list: 1679370, left: 4019, waiters: 0 }need:28672 > > > 00000008:00000020:30.0:1721247906.483647:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant()) > > > using 4096 grant credits for brw 000000004670e8ec page > > > 000000002b0d2b57 > > > 00000008:00000020:30.0:1721247906.483648:0:285997:0:(osc_cache.c:2388:osc_queue_async_io()) > > > extent 0000000064254af2@{[1466368 -> 1470385/1470463], > > > [2|1|-|active|wi|00000000005ec049], > > > [16482304|4017|+|-|000000000ffe1304|4096|0000000000000000]} expanded > > > for 1470385. > > > 00000008:00000002:30.0:1721247906.483650:0:285997:0:(osc_io.c:278:osc_page_touch_at()) > > > stripe KMS increasing 6022696960->6022701056 6022696960 > > > 00000008:00000020:30.0:1721247906.483650:0:285997:0:(osc_object.c:182:osc_attr_update()) > > > set kms from 6022696960to 6022701056 > > > 00020000:00000002:30.0:1721247906.483651:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms()) > > > MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808 > > > a=9223372036854775808 c=9223372036854775808 b=0 > > > 00020000:00000002:30.0:1721247906.483652:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms()) > > > MDT ID 0x21:1073784952 on OST[15]: s=6022701056 m=1721247906 a=0 > > > c=1721247906 b=11730952 > > > 00020000:00000002:30.0:1721247906.483653:0:285997:0:(lov_object.c:1050:lov_attr_get_composite()) > > > COMP ID #-1: s=6022701056 m=1721247906 a=0 c=1721247906 b=11730952 > > > 00000008:00000020:30.0:1721247906.483654:0:285997:0:(osc_cache.c:2355:osc_queue_async_io()) > > > obj 00000000005ec049 ready 0|-|- wr 0|-|- rd 0|- oap 0000000044245bf2 > > > page 00000000ff11bc07 added for cmd 2 > > > 00000008:00000020:30.0:1721247906.483655:0:285997:0:(osc_cache.c:1510:osc_enter_cache_try()) > > > snx11255-OST000f-osc-ffff891b019c8000: grant { dirty: 4018/512000 > > > dirty_pages: 12056/66011400 dropped: 0 avail: 1863196671, dirty_grant: > > > 16482304, reserved: 0, flight: 0 } lru {in > > > list: 1679370, left: 4019, waiters: 0 }need:28672 > > > > > > --- this part above repeats for awhile > > > > > > 00000008:00000020:30.0:1721247906.483656:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant()) > > > using 4096 grant credits for brw 000000006d9a7b11 page > > > 00000000ff11bc07 > > > 00000008:00000020:30.0:1721247906.483657:0:285997:0:(osc_cache.c:2388:osc_queue_async_io()) > > > extent 0000000064254af2@{[1466368 -> 1470386/1470463], > > > [2|1|-|active|wi|00000000005ec049], > > > [16486400|4018|+|-|000000000ffe1304|4096|0000000000000000]} expanded > > > for 1470386. > > > 00000008:00000002:30.0:1721247906.483658:0:285997:0:(osc_io.c:278:osc_page_touch_at()) > > > stripe KMS increasing 6022701056->6022701376 6022701056 > > > 00000008:00000020:30.0:1721247906.483659:0:285997:0:(osc_object.c:182:osc_attr_update()) > > > set kms from 6022701056to 6022701376 > > > 00000008:00000020:30.0:1721247906.483660:0:285997:0:(osc_request.c:763:osc_update_next_shrink()) > > > next time 608550 to shrink grant > > > 00000080:00200000:30.0:1721247906.483664:0:285997:0:(vvp_io.c:1200:vvp_io_write_commit()) > > > Committed 14993 pages 61407552 bytes, tot: 396953728 > > > 00020000:00000002:30.0:1721247906.483665:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms()) > > > MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808 > > > a=9223372036854775808 c=9223372036854775808 b=0 > > > 00020000:00000002:30.0:1721247906.483666:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms()) > > > MDT ID 0x21:1073784952 on OST[15]: s=6022701376 m=1721247906 a=0 > > > c=1721247906 b=11730952 > > > 00020000:00000002:30.0:1721247906.483667:0:285997:0:(lov_object.c:1050:lov_attr_get_composite()) > > > COMP ID #-1: s=6022701376 m=1721247906 a=0 c=1721247906 b=11730952 > > > 00000080:00200000:30.0:1721247906.483668:0:285997:0:(file.c:1490:ll_merge_attr()) > > > [0x24000a878:0x21:0x0] updating i_size 6022701376 > > > 00000080:00200000:30.0:1721247906.483671:0:285997:0:(vvp_io.c:1364:vvp_io_write_start()) > > > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: write nob 0, result: > > > 396953728 > > > 00000008:00000020:30.0:1721247906.483674:0:285997:0:(osc_cache.c:1709:osc_update_pending()) > > > obj 00000000005ec049 ready 0|-|- wr 4019|-|- rd 0|- update pending cmd > > > 2 delta 4019. > > > 00000080:00200000:30.0:1721247906.483687:0:285997:0:(vvp_io.c:537:vvp_io_advance()) > > > advancing 396953728 bytes > > > 00000080:00200000:30.0:1721247906.483690:0:285997:0:(file.c:1768:ll_file_io_generic()) > > > Range unlock [1373473, 1470386] > > > 00000008:00000020:30.0:1721247906.483656:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant()) > > > using 4096 grant credits for brw 000000006d9a7b11 page > > > 00000000ff11bc07 > > > 00000008:00000020:30.0:1721247906.483657:0:285997:0:(osc_cache.c:2388:osc_queue_async_io()) > > > extent 0000000064254af2@{[1466368 -> 1470386/1470463], > > > [2|1|-|active|wi|00000000005ec049], > > > [16486400|4018|+|-|000000000ffe1304|4096|0000000000000000]} expanded > > > for 1470386. > > > 00000008:00000002:30.0:1721247906.483658:0:285997:0:(osc_io.c:278:osc_page_touch_at()) > > > stripe KMS increasing 6022701056->6022701376 6022701056 > > > 00000008:00000020:30.0:1721247906.483659:0:285997:0:(osc_object.c:182:osc_attr_update()) > > > set kms from 6022701056to 6022701376 > > > 00000008:00000020:30.0:1721247906.483660:0:285997:0:(osc_request.c:763:osc_update_next_shrink()) > > > next time 608550 to shrink grant > > > 00000080:00200000:30.0:1721247906.483664:0:285997:0:(vvp_io.c:1200:vvp_io_write_commit()) > > > Committed 14993 pages 61407552 bytes, tot: 396953728 > > > 00020000:00000002:30.0:1721247906.483665:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms()) > > > MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808 > > > a=9223372036854775808 c=9223372036854775808 b=0 > > > 00020000:00000002:30.0:1721247906.483666:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms()) > > > MDT ID 0x21:1073784952 on OST[15]: s=6022701376 m=1721247906 a=0 > > > c=1721247906 b=11730952 > > > 00020000:00000002:30.0:1721247906.483667:0:285997:0:(lov_object.c:1050:lov_attr_get_composite()) > > > COMP ID #-1: s=6022701376 m=1721247906 a=0 c=1721247906 b=11730952 > > > 00000080:00200000:30.0:1721247906.483668:0:285997:0:(file.c:1490:ll_merge_attr()) > > > [0x24000a878:0x21:0x0] updating i_size 6022701376 > > > 00000080:00200000:30.0:1721247906.483671:0:285997:0:(vvp_io.c:1364:vvp_io_write_start()) > > > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: write nob 0, result: > > > 396953728 > > > 00000008:00000020:30.0:1721247906.483674:0:285997:0:(osc_cache.c:1709:osc_update_pending()) > > > obj 00000000005ec049 ready 0|-|- wr 4019|-|- rd 0|- update pending cmd > > > 2 delta 4019. > > > 00000080:00200000:30.0:1721247906.483687:0:285997:0:(vvp_io.c:537:vvp_io_advance()) > > > advancing 396953728 bytes > > > 00000080:00200000:30.0:1721247906.483690:0:285997:0:(file.c:1768:ll_file_io_generic()) > > > Range unlock [1373473, 1470386] > > > 00000080:00200000:30.0:1721247906.483693:0:285997:0:(vvp_io.c:307:vvp_io_fini()) > > > [0x24000a878:0x21:0x0] ignore/verify layout 0/0, layout version 0 need > > > write layout 0, restore needed 0 > > > 00000080:00200000:30.0:1721247906.483694:0:285997:0:(file.c:1809:ll_file_io_generic()) > > > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: 2 io complete with rc: > > > 0, result: 2007566464, restart: 0 > > > 00000080:00200000:30.0:1721247906.483697:0:285997:0:(file.c:1874:ll_file_io_generic()) > > > iot: 2, result: 2007566464 > > > 00000080:00200008:30.0:1721247906.487347:0:285997:0:(file.c:2167:ll_file_write_iter()) > > > file > > > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt:[0x24000a878:0x21:0x0], > > > ppos: 6022701376, count: 1200 > > > 00000080:00200000:30.0:1721247906.487354:0:285997:0:(rw26.c:731:ll_write_begin()) > > > Writing 1470386 of 320 to 1200 bytes > > > 00000080:00200000:30.0:1721247906.487358:0:285997:0:(rw26.c:917:ll_write_end()) > > > pos 6022701376, len 1200, copied 1200 > > > 00000008:00000002:30.0:1721247906.487364:0:285997:0:(osc_io.c:278:osc_page_touch_at()) > > > stripe KMS increasing 6022701376->6022702576 6022701376 > > > 00000008:00000020:30.0:1721247906.487366:0:285997:0:(osc_object.c:182:osc_attr_update()) > > > set kms from 6022701376to 6022702576 > > > 00000080:00200000:30.0:1721247906.487370:0:285997:0:(file.c:2146:ll_do_tiny_write()) > > > result: 1200, original count 1200 > > > 00020000:00000002:30.0:1721247906.487378:0:285997:0:(lov_object.c:1182:lov_conf_freeze()) > > > To take share lov(00000000e2618e16) owner > > > 0000000000000000/0000000065ef6d4c > > > 00020000:00000002:30.0:1721247906.487380:0:285997:0:(lov_object.c:1190:lov_conf_thaw()) > > > To release share lov(00000000e2618e16) owner > > > 0000000000000000/0000000065ef6d4c > > > 00000080:00200000:30.0:1721247906.487384:0:285997:0:(file.c:405:ll_file_release()) > > > VFS Op:inode=[0x24000a878:0x21:0x0](000000005f8eb1b5) > > > 00020000:00000002:30.0:1721247906.487385:0:285997:0:(lov_object.c:1182:lov_conf_freeze()) > > > To take share lov(00000000e2618e16) owner > > > 0000000000000000/0000000065ef6d4c > > > 00020000:00000002:30.0:1721247906.487386:0:285997:0:(lov_object.c:1190:lov_conf_thaw()) > > > To release share lov(00000000e2618e16) owner > > > 0000000000000000/0000000065ef6d4c > > > 00800000:00000002:30.0:1721247906.487389:0:285997:0:(lmv_obd.c:3572:lmv_lock_match()) > > > Lock match for [0x24000a878:0x21:0x0] > > > 00800000:00000002:30.0:1721247906.487393:0:285997:0:(lmv_fld.c:77:lmv_fld_lookup()) > > > FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0] > > > 00800000:00000002:30.0:1721247906.487408:0:285997:0:(lmv_fld.c:77:lmv_fld_lookup()) > > > FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0] > > > 00800000:00000002:30.0:1721247906.487409:0:285997:0:(lmv_obd.c:1457:lmv_close()) > > > CLOSE [0x24000a878:0x21:0x0] > > > 00000002:00000002:30.0:1721247906.487411:0:285997:0:(mdc_request.c:902:mdc_close()) > > > snx11255-MDT0001-mdc-ffff891b019c8000: [0x24000a878:0x21:0x0] file > > > closed with intent: 200 > > > 00000002:00100000:30.0:1721247906.487418:0:285997:0:(mdc_request.c:944:mdc_close()) > > > @@@ matched open req@00000000da251270 > > > x1804223383477312/t99088005029(99088005029) > > > o101->snx11255-MDT0001-mdc-ffff891b019c8000@172.18.20.21@o2ib100:12/10 > > > lens 616/656 e 0 to 0 dl 1721247862 ref 1 fl Complete:RPQU/4/ffffffff > > > rc 0/-1 job:'' > > > 00000020:00100000:30.0:1721247906.487431:0:285997:0:(genops.c:2299:obd_get_mod_rpc_slot()) > > > snx11255-MDT0001-mdc-ffff891b019c8000: modify RPC slot 1 is allocated > > > opc 35, max 7 > > > 00000100:00100000:30.0:1721247906.487434:0:285997:0:(client.c:742:ptlrpc_reassign_next_xid()) > > > @@@ reassign xid req@000000007e3dc9a5 x1804223383853824/t0(0) > > > o35->snx11255-MDT0001-mdc-ffff891b019c8000@172.18.20.21@o2ib100:23/10 > > > lens 392/1104 e 0 to 0 dl 0 ref 1 fl New:QU/0/ffffffff rc 0/-1 job:'' > > > 00000100:00100000:30.0:1721247906.487443:0:285997:0:(client.c:1727:ptlrpc_send_new_req()) > > > Sending RPC req@000000007e3dc9a5 pname:cluuid:pid:xid:nid:opc:job > > > pt_main_thread:409f80a0-cdfe-466d-9a49-014ed62a4391:285997:1804223383853824:172.18.20.21@o2ib100:35: > > > 00000100:00100000:30.0:1721247906.487486:0:285997:0:(client.c:2491:ptlrpc_set_wait()) > > > set 000000000180e080 going to sleep for 11 seconds > > > 00000100:00100000:30.0:1721247906.487761:0:285997:0:(client.c:2207:ptlrpc_check_set()) > > > Completed RPC req@000000007e3dc9a5 pname:cluuid:pid:xid:nid:opc:job > > > pt_main_thread:409f80a0-cdfe-466d-9a49-014ed62a4391:285997:1804223383853824:172.18.20.21@o2ib100:35: > > > 00800000:00000002:30.0:1721247906.487766:0:285997:0:(lmv_fld.c:77:lmv_fld_lookup()) > > > FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0] > > > 00000002:00100000:30.0:1721247906.487768:0:285997:0:(mdc_request.c:852:mdc_free_open()) > > > @@@ free open request, rq_replay=0 req@00000000da251270 > > > x1804223383477312/t99088005029(99088005029) > > > o101->snx11255-MDT0001-mdc-ffff891b019c8000@172.18.20.21@o2ib100:12/10 > > > lens 616/656 e 0 to 0 dl 1721247862 ref 1 fl Complete:RQU/4 > > > /ffffffff rc 0/-1 job:'' > > > > > > --- this part repeats with no other output > > > > > > 00000080:00200000:8.0:1721247906.528444:0:72:0:(vvp_io.c:1788:vvp_io_init()) > > > [0x24000a876:0x26:0x0] ignore/verify layout 1/0, layout version 0 > > > restore needed 0 > > > 00020000:00000002:8.0:1721247906.528446:0:72:0:(lov_object.c:1486:lov_io_init()) > > > [0x24000a876:0x26:0x0]io 000000001dcefeeb type 6 ignore/verify layout > > > 1/0 > > > 00020000:00000002:8.0:1721247906.528447:0:72:0:(lov_object.c:1182:lov_conf_freeze()) > > > To take share lov(0000000018a6920a) owner > > > 0000000000000000/0000000096d0da59 > > > 00020000:00000002:8.0:1721247906.528448:0:72:0:(lov_object.c:1190:lov_conf_thaw()) > > > To release share lov(0000000018a6920a) owner > > > 0000000000000000/0000000096d0da59 > > > 00020000:00200000:8.0:1721247906.528449:0:72:0:(lov_io.c:824:lov_io_iter_init()) > > > component[0] flags 0x10 > > > 00020000:00000002:8.0:1721247906.528450:0:72:0:(lov_offset.c:266:lov_stripe_intersects()) > > > [0->-1] -> [(0) 0->-1 (0)] > > > 00000008:00000020:8.0:1721247906.528452:0:72:0:(osc_cache.c:3049:osc_cache_writeback_range()) > > > obj 00000000c9987be6 ready 0|-|- wr 0|-|- rd 0|- pageout [0, > > > 18446744073709551615], 0. > > > 00000080:00200000:8.0:1721247906.528454:0:72:0:(vvp_io.c:307:vvp_io_fini()) > > > [0x24000a876:0x26:0x0] ignore/verify layout 1/0, layout version 0 need > > > write layout 0, restore needed 0 > > > 00000080:00200000:8.0:1721247906.528455:0:72:0:(vvp_io.c:1788:vvp_io_init()) > > > [0x24000a876:0x2b:0x0] ignore/verify layout 1/0, layout version 0 > > > restore needed 0 > > > 00020000:00000002:8.0:1721247906.528456:0:72:0:(lov_object.c:1486:lov_io_init()) > > > [0x24000a876:0x2b:0x0]io 000000001dcefeeb type 6 ignore/verify layout > > > 1/0 > > > 00020000:00000002:8.0:1721247906.528457:0:72:0:(lov_object.c:1182:lov_conf_freeze()) > > > To take share lov(0000000020610153) owner > > > 0000000000000000/0000000096d0da59 > > > 00020000:00000002:8.0:1721247906.528458:0:72:0:(lov_object.c:1190:lov_conf_thaw()) > > > To release share lov(0000000020610153) owner > > > 0000000000000000/0000000096d0da59 > > > 00020000:00200000:8.0:1721247906.528458:0:72:0:(lov_io.c:824:lov_io_iter_init()) > > > component[0] flags 0x10 > > > 00020000:00000002:8.0:1721247906.528459:0:72:0:(lov_offset.c:266:lov_stripe_intersects()) > > > [0->-1] -> [(0) 0->-1 (0)] > > > 00000008:00000020:8.0:1721247906.528460:0:72:0:(osc_cache.c:3049:osc_cache_writeback_range()) > > > obj 000000005da60c8b ready 0|-|- wr 0|-|- rd 0|- pageout [0, > > > 18446744073709551615], 0. > > > 00000080:00200000:8.0:1721247906.528462:0:72:0:(vvp_io.c:307:vvp_io_fini()) > > > [0x24000a876:0x2b:0x0] ignore/verify layout 1/0, layout version 0 need > > > write layout 0, restore needed 0 > > > 00000080:00200000:8.0:1721247906.528463:0:72:0:(vvp_io.c:1788:vvp_io_init()) > > > [0x24000a876:0x29:0x0] ignore/verify layout 1/0, layout version 0 > > > restore needed 0 > > > 00020000:00000002:8.0:1721247906.528464:0:72:0:(lov_object.c:1486:lov_io_init()) > > > [0x24000a876:0x29:0x0]io 000000001dcefeeb type 6 ignore/verify layout > > > 1/0 > > > 00020000:00000002:8.0:1721247906.528465:0:72:0:(lov_object.c:1182:lov_conf_freeze()) > > > To take share lov(00000000c3ae5165) owner > > > 0000000000000000/0000000096d0da59 > > > 00020000:00000002:8.0:1721247906.528465:0:72:0:(lov_object.c:1190:lov_conf_thaw()) > > > To release share lov(00000000c3ae5165) owner > > > 0000000000000000/0000000096d0da59 > > > > > > On Fri, Jul 19, 2024 at 8:58 AM Michael DiDomenico > > > <mdidomeni...@gmail.com> wrote: > > > > > > > > interestingly enough i ran > > > > > > > > cut -f4 -d: lustre_debug* | cut -f1 -d. | sort | uniq -c > > > > > > > > and while the timeline doesn't show a big gap, the number of records > > > > per time interval does. i can clearly see the slope of the job and > > > > exactly where it seems to have fallen off > > > > > > > > i might be able to get the small chunk of the log exported, but i'll > > > > paraphrase the log a little below > > > > > > > > starting with many entries repeat that repeat > > > > > > > > using 4096 grant cretis for brw > > > > osc grant > > > > stripe kms increasing > > > > set kms from > > > > mdt id initial value > > > > mdt id on ost > > > > comp id #-1 > > > > obj ready > > > > osc grant > > > > > > > > then it stops and this comes out > > > > > > > > ll_mert_attr update i_sze > > > > bf16_zero_pp_rank_7_mp_rank_01_optim_Stats.pt > > > > obj ready update pending cmd > > > > advancing bytes > > > > range unlock > > > > vvp_io_fini() > > > > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt > > > > writing x of x to x bytes > > > > pos len copied > > > > stripe kms increasing > > > > set kms from > > > > ll_do_tiny_write() > > > > to take share lv owner > > > > to release share lov owner > > > > vfs op:onoe > > > > to take share_lov > > > > to_release share lov > > > > lock match for > > > > FLD lookup got mds 1 > > > > fld lookup got mds > > > > CLOSE > > > > mdc file closed with intent 200 > > > > @@@ matched open mdc > > > > mdc modify rpc slot 1 > > > > @@@ reassign xid > > > > sending RPC > > > > set going to sleep for 11 seconds > > > > completed rpc > > > > fld lookup got mds 1 > > > > @@@ free open request > > > > > > > > after the last command, it then spirals out repeating until the end of > > > > the log > > > > > > > > vvp_io_init > > > > lov_io_init > > > > lov_conf_freeze to take share > > > > lov_conf_thaw to release share > > > > > > > > On Thu, Jul 18, 2024 at 3:24 PM Oleg Drokin <gr...@whamcloud.com> wrote: > > > > > > > > > > On Thu, 2024-07-18 at 13:26 -0400, Michael DiDomenico wrote: > > > > > > there's nothing 'grep -i evict' in the lustre_debug logs or from the > > > > > > storage console logs > > > > > > > > > > hm, ok. > > > > > > > > > > > > perhaps see continuity of the timestamps and what happened right > > > > > > > before > > > > > > > and right after the gap if there is one in the times? > > > > > > > > > > > > > > > > > > > i pulled a counter from the logs of the functions calls, maybe one > > > > > > of > > > > > > these looks off (this is just the ones over 100k), please excuse > > > > > > typos > > > > > > > > > > > > $ grep -vh "^$" lustre_debug*.log | cut -f10 -d: | cut -f1 -d\) | > > > > > > sort > > > > > > > uniq -c | sort -n > > > > > > 105034 lov_io_init > > > > > > 105034 vvp_io_init > > > > > > 105035 lov_io_iter_init > > > > > > 105035 lob_strip_intersects > > > > > > 105035 osc_cache_writeback_range > > > > > > 105043 vvp_io_fini > > > > > > 105050 lov_conf_freeze > > > > > > 105050 lov_conf_thaw > > > > > > 294806 osc_attr-update > > > > > > 294806 osc_page_touch_at > > > > > > 294814 osc_consume_write_grant > > > > > > 294815 lov_attr_get_composite > > > > > > 294816 osc_enter_cache_try > > > > > > 351044 ll_write_end > > > > > > 589549 osc_queueu_async_io > > > > > > 589617 lov_merge_lvm_kms > > > > > > > > > > that's not really going to do anything useful, there's a timestamp in > > > > > unix time as the fourth field (separated with colons), see if there > > > > > are > > > > > gaps there. > > > > > I imagine there's going to be real dense (time-wise) activity) then an > > > > > RPC is prepared and send (Sending RPC ....) and then a lot sparser > > > > > activity perhaps with multi-second pauses) and then eventually it'll > > > > > pick up after gettign a server response for example? > > > > > Though none of that explains why lctl would hang I guess, but still > > > > > > > > _______________________________________________ > > > lustre-discuss mailing list > > > lustre-discuss@lists.lustre.org > > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org