is pytorch using HDF5?  i wasn't aware that it did.  we are using
'flock' on all the client mounts

On Wed, Jul 24, 2024 at 10:26 PM Feng Zhang <prod.f...@gmail.com> wrote:
>
> Not sure if it is related, but I digged out this from my note:
>
> "Looks like the Lustre storage does not support the HDF5 lock. The
> local hard drive($TMPDIR), or NFS works fine. If you need to read HDF5
> files on /lustre, just add:
>
> export HDF5_USE_FILE_LOCKING=FALSE
> "
> (older versions of netcdf4 and hdf5 work fine. hdf5 1.10 and after has
> this issue.). SInce you are using pytorch, it might be related.
>
> Lustre client mounting also supports a flag to support the locking
> thing...You will need to specify it explicitly when mout.(not remember
> which one)
>
> Best,
>
> Feng
>
> On Wed, Jul 24, 2024 at 3:53 PM Feng Zhang <prod.f...@gmail.com> wrote:
> >
> > I have a vague memory from many years ago, that I had an issue that
> > was a file lock on lustre, that caused python app to be stuck(or
> > failed).
> >
> > Best,
> >
> > Feng
> >
> > On Wed, Jul 24, 2024 at 2:40 PM Michael DiDomenico via lustre-discuss
> > <lustre-discuss@lists.lustre.org> wrote:
> > >
> > > what it's worth here's the chunk of debug that seemed suspect.  not
> > > sure if helps anything
> > >
> > > 00000008:00000020:30.0:1721247906.483639:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant())
> > > using 4096 grant credits for brw 000000005265749c page
> > > 00000000a0744e3b
> > > 00000008:00000020:30.0:1721247906.483640:0:285997:0:(osc_cache.c:2388:osc_queue_async_io())
> > > extent 0000000064254af2@{[1466368 -> 1470384/1470463],
> > > [2|1|-|active|wi|00000000005ec049],
> > > [16478208|4016|+|-|000000000ffe1304|4096|0000000000000000]} expanded
> > > for 1470384.
> > > 00000008:00000002:30.0:1721247906.483641:0:285997:0:(osc_io.c:278:osc_page_touch_at())
> > > stripe KMS increasing 6022692864->6022696960 6022692864
> > > 00000008:00000020:30.0:1721247906.483642:0:285997:0:(osc_object.c:182:osc_attr_update())
> > > set kms from 6022692864to 6022696960
> > > 00020000:00000002:30.0:1721247906.483643:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms())
> > > MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808
> > > a=9223372036854775808 c=9223372036854775808 b=0
> > > 00020000:00000002:30.0:1721247906.483643:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms())
> > > MDT ID 0x21:1073784952 on OST[15]: s=6022696960 m=1721247906 a=0
> > > c=1721247906 b=11730952
> > > 00020000:00000002:30.0:1721247906.483644:0:285997:0:(lov_object.c:1050:lov_attr_get_composite())
> > > COMP ID #-1: s=6022696960 m=1721247906 a=0 c=1721247906 b=11730952
> > > 00000008:00000020:30.0:1721247906.483645:0:285997:0:(osc_cache.c:2355:osc_queue_async_io())
> > > obj 00000000005ec049 ready 0|-|- wr 0|-|- rd 0|- oap 000000001625d339
> > > page 000000002b0d2b57 added for cmd 2
> > > 00000008:00000020:30.0:1721247906.483646:0:285997:0:(osc_cache.c:1510:osc_enter_cache_try())
> > > snx11255-OST000f-osc-ffff891b019c8000: grant { dirty: 4017/512000
> > > dirty_pages: 12055/66011400 dropped: 0 avail: 1863200767, dirty_grant:
> > > 16478208, reserved: 0, flight: 0 } lru {in
> > > list: 1679370, left: 4019, waiters: 0 }need:28672
> > > 00000008:00000020:30.0:1721247906.483647:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant())
> > > using 4096 grant credits for brw 000000004670e8ec page
> > > 000000002b0d2b57
> > > 00000008:00000020:30.0:1721247906.483648:0:285997:0:(osc_cache.c:2388:osc_queue_async_io())
> > > extent 0000000064254af2@{[1466368 -> 1470385/1470463],
> > > [2|1|-|active|wi|00000000005ec049],
> > > [16482304|4017|+|-|000000000ffe1304|4096|0000000000000000]} expanded
> > > for 1470385.
> > > 00000008:00000002:30.0:1721247906.483650:0:285997:0:(osc_io.c:278:osc_page_touch_at())
> > > stripe KMS increasing 6022696960->6022701056 6022696960
> > > 00000008:00000020:30.0:1721247906.483650:0:285997:0:(osc_object.c:182:osc_attr_update())
> > > set kms from 6022696960to 6022701056
> > > 00020000:00000002:30.0:1721247906.483651:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms())
> > > MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808
> > > a=9223372036854775808 c=9223372036854775808 b=0
> > > 00020000:00000002:30.0:1721247906.483652:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms())
> > > MDT ID 0x21:1073784952 on OST[15]: s=6022701056 m=1721247906 a=0
> > > c=1721247906 b=11730952
> > > 00020000:00000002:30.0:1721247906.483653:0:285997:0:(lov_object.c:1050:lov_attr_get_composite())
> > > COMP ID #-1: s=6022701056 m=1721247906 a=0 c=1721247906 b=11730952
> > > 00000008:00000020:30.0:1721247906.483654:0:285997:0:(osc_cache.c:2355:osc_queue_async_io())
> > > obj 00000000005ec049 ready 0|-|- wr 0|-|- rd 0|- oap 0000000044245bf2
> > > page 00000000ff11bc07 added for cmd 2
> > > 00000008:00000020:30.0:1721247906.483655:0:285997:0:(osc_cache.c:1510:osc_enter_cache_try())
> > > snx11255-OST000f-osc-ffff891b019c8000: grant { dirty: 4018/512000
> > > dirty_pages: 12056/66011400 dropped: 0 avail: 1863196671, dirty_grant:
> > > 16482304, reserved: 0, flight: 0 } lru {in
> > > list: 1679370, left: 4019, waiters: 0 }need:28672
> > >
> > > --- this part above repeats for awhile
> > >
> > > 00000008:00000020:30.0:1721247906.483656:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant())
> > > using 4096 grant credits for brw 000000006d9a7b11 page
> > > 00000000ff11bc07
> > > 00000008:00000020:30.0:1721247906.483657:0:285997:0:(osc_cache.c:2388:osc_queue_async_io())
> > > extent 0000000064254af2@{[1466368 -> 1470386/1470463],
> > > [2|1|-|active|wi|00000000005ec049],
> > > [16486400|4018|+|-|000000000ffe1304|4096|0000000000000000]} expanded
> > > for 1470386.
> > > 00000008:00000002:30.0:1721247906.483658:0:285997:0:(osc_io.c:278:osc_page_touch_at())
> > > stripe KMS increasing 6022701056->6022701376 6022701056
> > > 00000008:00000020:30.0:1721247906.483659:0:285997:0:(osc_object.c:182:osc_attr_update())
> > > set kms from 6022701056to 6022701376
> > > 00000008:00000020:30.0:1721247906.483660:0:285997:0:(osc_request.c:763:osc_update_next_shrink())
> > > next time 608550 to shrink grant
> > > 00000080:00200000:30.0:1721247906.483664:0:285997:0:(vvp_io.c:1200:vvp_io_write_commit())
> > > Committed 14993 pages 61407552 bytes, tot: 396953728
> > > 00020000:00000002:30.0:1721247906.483665:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms())
> > > MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808
> > > a=9223372036854775808 c=9223372036854775808 b=0
> > > 00020000:00000002:30.0:1721247906.483666:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms())
> > > MDT ID 0x21:1073784952 on OST[15]: s=6022701376 m=1721247906 a=0
> > > c=1721247906 b=11730952
> > > 00020000:00000002:30.0:1721247906.483667:0:285997:0:(lov_object.c:1050:lov_attr_get_composite())
> > > COMP ID #-1: s=6022701376 m=1721247906 a=0 c=1721247906 b=11730952
> > > 00000080:00200000:30.0:1721247906.483668:0:285997:0:(file.c:1490:ll_merge_attr())
> > > [0x24000a878:0x21:0x0] updating i_size 6022701376
> > > 00000080:00200000:30.0:1721247906.483671:0:285997:0:(vvp_io.c:1364:vvp_io_write_start())
> > > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: write nob 0, result:
> > > 396953728
> > > 00000008:00000020:30.0:1721247906.483674:0:285997:0:(osc_cache.c:1709:osc_update_pending())
> > > obj 00000000005ec049 ready 0|-|- wr 4019|-|- rd 0|- update pending cmd
> > > 2 delta 4019.
> > > 00000080:00200000:30.0:1721247906.483687:0:285997:0:(vvp_io.c:537:vvp_io_advance())
> > > advancing 396953728 bytes
> > > 00000080:00200000:30.0:1721247906.483690:0:285997:0:(file.c:1768:ll_file_io_generic())
> > > Range unlock [1373473, 1470386]
> > > 00000008:00000020:30.0:1721247906.483656:0:285997:0:(osc_cache.c:1374:osc_consume_write_grant())
> > > using 4096 grant credits for brw 000000006d9a7b11 page
> > > 00000000ff11bc07
> > > 00000008:00000020:30.0:1721247906.483657:0:285997:0:(osc_cache.c:2388:osc_queue_async_io())
> > > extent 0000000064254af2@{[1466368 -> 1470386/1470463],
> > > [2|1|-|active|wi|00000000005ec049],
> > > [16486400|4018|+|-|000000000ffe1304|4096|0000000000000000]} expanded
> > > for 1470386.
> > > 00000008:00000002:30.0:1721247906.483658:0:285997:0:(osc_io.c:278:osc_page_touch_at())
> > > stripe KMS increasing 6022701056->6022701376 6022701056
> > > 00000008:00000020:30.0:1721247906.483659:0:285997:0:(osc_object.c:182:osc_attr_update())
> > > set kms from 6022701056to 6022701376
> > > 00000008:00000020:30.0:1721247906.483660:0:285997:0:(osc_request.c:763:osc_update_next_shrink())
> > > next time 608550 to shrink grant
> > > 00000080:00200000:30.0:1721247906.483664:0:285997:0:(vvp_io.c:1200:vvp_io_write_commit())
> > > Committed 14993 pages 61407552 bytes, tot: 396953728
> > > 00020000:00000002:30.0:1721247906.483665:0:285997:0:(lov_merge.c:60:lov_merge_lvb_kms())
> > > MDT ID 0x21:1073784952 initial value: s=0 m=9223372036854775808
> > > a=9223372036854775808 c=9223372036854775808 b=0
> > > 00020000:00000002:30.0:1721247906.483666:0:285997:0:(lov_merge.c:94:lov_merge_lvb_kms())
> > > MDT ID 0x21:1073784952 on OST[15]: s=6022701376 m=1721247906 a=0
> > > c=1721247906 b=11730952
> > > 00020000:00000002:30.0:1721247906.483667:0:285997:0:(lov_object.c:1050:lov_attr_get_composite())
> > > COMP ID #-1: s=6022701376 m=1721247906 a=0 c=1721247906 b=11730952
> > > 00000080:00200000:30.0:1721247906.483668:0:285997:0:(file.c:1490:ll_merge_attr())
> > > [0x24000a878:0x21:0x0] updating i_size 6022701376
> > > 00000080:00200000:30.0:1721247906.483671:0:285997:0:(vvp_io.c:1364:vvp_io_write_start())
> > > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: write nob 0, result:
> > > 396953728
> > > 00000008:00000020:30.0:1721247906.483674:0:285997:0:(osc_cache.c:1709:osc_update_pending())
> > > obj 00000000005ec049 ready 0|-|- wr 4019|-|- rd 0|- update pending cmd
> > > 2 delta 4019.
> > > 00000080:00200000:30.0:1721247906.483687:0:285997:0:(vvp_io.c:537:vvp_io_advance())
> > > advancing 396953728 bytes
> > > 00000080:00200000:30.0:1721247906.483690:0:285997:0:(file.c:1768:ll_file_io_generic())
> > > Range unlock [1373473, 1470386]
> > > 00000080:00200000:30.0:1721247906.483693:0:285997:0:(vvp_io.c:307:vvp_io_fini())
> > > [0x24000a878:0x21:0x0] ignore/verify layout 0/0, layout version 0 need
> > > write layout 0, restore needed 0
> > > 00000080:00200000:30.0:1721247906.483694:0:285997:0:(file.c:1809:ll_file_io_generic())
> > > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt: 2 io complete with rc:
> > > 0, result: 2007566464, restart: 0
> > > 00000080:00200000:30.0:1721247906.483697:0:285997:0:(file.c:1874:ll_file_io_generic())
> > > iot: 2, result: 2007566464
> > > 00000080:00200008:30.0:1721247906.487347:0:285997:0:(file.c:2167:ll_file_write_iter())
> > > file 
> > > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt:[0x24000a878:0x21:0x0],
> > > ppos: 6022701376, count: 1200
> > > 00000080:00200000:30.0:1721247906.487354:0:285997:0:(rw26.c:731:ll_write_begin())
> > > Writing 1470386 of 320 to 1200 bytes
> > > 00000080:00200000:30.0:1721247906.487358:0:285997:0:(rw26.c:917:ll_write_end())
> > > pos 6022701376, len 1200, copied 1200
> > > 00000008:00000002:30.0:1721247906.487364:0:285997:0:(osc_io.c:278:osc_page_touch_at())
> > > stripe KMS increasing 6022701376->6022702576 6022701376
> > > 00000008:00000020:30.0:1721247906.487366:0:285997:0:(osc_object.c:182:osc_attr_update())
> > > set kms from 6022701376to 6022702576
> > > 00000080:00200000:30.0:1721247906.487370:0:285997:0:(file.c:2146:ll_do_tiny_write())
> > > result: 1200, original count 1200
> > > 00020000:00000002:30.0:1721247906.487378:0:285997:0:(lov_object.c:1182:lov_conf_freeze())
> > > To take share lov(00000000e2618e16) owner
> > > 0000000000000000/0000000065ef6d4c
> > > 00020000:00000002:30.0:1721247906.487380:0:285997:0:(lov_object.c:1190:lov_conf_thaw())
> > > To release share lov(00000000e2618e16) owner
> > > 0000000000000000/0000000065ef6d4c
> > > 00000080:00200000:30.0:1721247906.487384:0:285997:0:(file.c:405:ll_file_release())
> > > VFS Op:inode=[0x24000a878:0x21:0x0](000000005f8eb1b5)
> > > 00020000:00000002:30.0:1721247906.487385:0:285997:0:(lov_object.c:1182:lov_conf_freeze())
> > > To take share lov(00000000e2618e16) owner
> > > 0000000000000000/0000000065ef6d4c
> > > 00020000:00000002:30.0:1721247906.487386:0:285997:0:(lov_object.c:1190:lov_conf_thaw())
> > > To release share lov(00000000e2618e16) owner
> > > 0000000000000000/0000000065ef6d4c
> > > 00800000:00000002:30.0:1721247906.487389:0:285997:0:(lmv_obd.c:3572:lmv_lock_match())
> > > Lock match for [0x24000a878:0x21:0x0]
> > > 00800000:00000002:30.0:1721247906.487393:0:285997:0:(lmv_fld.c:77:lmv_fld_lookup())
> > > FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0]
> > > 00800000:00000002:30.0:1721247906.487408:0:285997:0:(lmv_fld.c:77:lmv_fld_lookup())
> > > FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0]
> > > 00800000:00000002:30.0:1721247906.487409:0:285997:0:(lmv_obd.c:1457:lmv_close())
> > > CLOSE [0x24000a878:0x21:0x0]
> > > 00000002:00000002:30.0:1721247906.487411:0:285997:0:(mdc_request.c:902:mdc_close())
> > > snx11255-MDT0001-mdc-ffff891b019c8000: [0x24000a878:0x21:0x0] file
> > > closed with intent: 200
> > > 00000002:00100000:30.0:1721247906.487418:0:285997:0:(mdc_request.c:944:mdc_close())
> > > @@@ matched open  req@00000000da251270
> > > x1804223383477312/t99088005029(99088005029)
> > > o101->snx11255-MDT0001-mdc-ffff891b019c8000@172.18.20.21@o2ib100:12/10
> > > lens 616/656 e 0 to 0 dl 1721247862 ref 1 fl Complete:RPQU/4/ffffffff
> > > rc 0/-1 job:''
> > > 00000020:00100000:30.0:1721247906.487431:0:285997:0:(genops.c:2299:obd_get_mod_rpc_slot())
> > > snx11255-MDT0001-mdc-ffff891b019c8000: modify RPC slot 1 is allocated
> > > opc 35, max 7
> > > 00000100:00100000:30.0:1721247906.487434:0:285997:0:(client.c:742:ptlrpc_reassign_next_xid())
> > > @@@ reassign xid  req@000000007e3dc9a5 x1804223383853824/t0(0)
> > > o35->snx11255-MDT0001-mdc-ffff891b019c8000@172.18.20.21@o2ib100:23/10
> > > lens 392/1104 e 0 to 0 dl 0 ref 1 fl New:QU/0/ffffffff rc 0/-1 job:''
> > > 00000100:00100000:30.0:1721247906.487443:0:285997:0:(client.c:1727:ptlrpc_send_new_req())
> > > Sending RPC req@000000007e3dc9a5 pname:cluuid:pid:xid:nid:opc:job
> > > pt_main_thread:409f80a0-cdfe-466d-9a49-014ed62a4391:285997:1804223383853824:172.18.20.21@o2ib100:35:
> > > 00000100:00100000:30.0:1721247906.487486:0:285997:0:(client.c:2491:ptlrpc_set_wait())
> > > set 000000000180e080 going to sleep for 11 seconds
> > > 00000100:00100000:30.0:1721247906.487761:0:285997:0:(client.c:2207:ptlrpc_check_set())
> > > Completed RPC req@000000007e3dc9a5 pname:cluuid:pid:xid:nid:opc:job
> > > pt_main_thread:409f80a0-cdfe-466d-9a49-014ed62a4391:285997:1804223383853824:172.18.20.21@o2ib100:35:
> > > 00800000:00000002:30.0:1721247906.487766:0:285997:0:(lmv_fld.c:77:lmv_fld_lookup())
> > > FLD lookup got mds #1 for fid=[0x24000a878:0x21:0x0]
> > > 00000002:00100000:30.0:1721247906.487768:0:285997:0:(mdc_request.c:852:mdc_free_open())
> > > @@@ free open request, rq_replay=0  req@00000000da251270
> > > x1804223383477312/t99088005029(99088005029)
> > > o101->snx11255-MDT0001-mdc-ffff891b019c8000@172.18.20.21@o2ib100:12/10
> > > lens 616/656 e 0 to 0 dl 1721247862 ref 1 fl Complete:RQU/4
> > > /ffffffff rc 0/-1 job:''
> > >
> > > --- this part repeats with no other output
> > >
> > > 00000080:00200000:8.0:1721247906.528444:0:72:0:(vvp_io.c:1788:vvp_io_init())
> > > [0x24000a876:0x26:0x0] ignore/verify layout 1/0, layout version 0
> > > restore needed 0
> > > 00020000:00000002:8.0:1721247906.528446:0:72:0:(lov_object.c:1486:lov_io_init())
> > > [0x24000a876:0x26:0x0]io 000000001dcefeeb type 6 ignore/verify layout
> > > 1/0
> > > 00020000:00000002:8.0:1721247906.528447:0:72:0:(lov_object.c:1182:lov_conf_freeze())
> > > To take share lov(0000000018a6920a) owner
> > > 0000000000000000/0000000096d0da59
> > > 00020000:00000002:8.0:1721247906.528448:0:72:0:(lov_object.c:1190:lov_conf_thaw())
> > > To release share lov(0000000018a6920a) owner
> > > 0000000000000000/0000000096d0da59
> > > 00020000:00200000:8.0:1721247906.528449:0:72:0:(lov_io.c:824:lov_io_iter_init())
> > > component[0] flags 0x10
> > > 00020000:00000002:8.0:1721247906.528450:0:72:0:(lov_offset.c:266:lov_stripe_intersects())
> > > [0->-1] -> [(0) 0->-1 (0)]
> > > 00000008:00000020:8.0:1721247906.528452:0:72:0:(osc_cache.c:3049:osc_cache_writeback_range())
> > > obj 00000000c9987be6 ready 0|-|- wr 0|-|- rd 0|- pageout [0,
> > > 18446744073709551615], 0.
> > > 00000080:00200000:8.0:1721247906.528454:0:72:0:(vvp_io.c:307:vvp_io_fini())
> > > [0x24000a876:0x26:0x0] ignore/verify layout 1/0, layout version 0 need
> > > write layout 0, restore needed 0
> > > 00000080:00200000:8.0:1721247906.528455:0:72:0:(vvp_io.c:1788:vvp_io_init())
> > > [0x24000a876:0x2b:0x0] ignore/verify layout 1/0, layout version 0
> > > restore needed 0
> > > 00020000:00000002:8.0:1721247906.528456:0:72:0:(lov_object.c:1486:lov_io_init())
> > > [0x24000a876:0x2b:0x0]io 000000001dcefeeb type 6 ignore/verify layout
> > > 1/0
> > > 00020000:00000002:8.0:1721247906.528457:0:72:0:(lov_object.c:1182:lov_conf_freeze())
> > > To take share lov(0000000020610153) owner
> > > 0000000000000000/0000000096d0da59
> > > 00020000:00000002:8.0:1721247906.528458:0:72:0:(lov_object.c:1190:lov_conf_thaw())
> > > To release share lov(0000000020610153) owner
> > > 0000000000000000/0000000096d0da59
> > > 00020000:00200000:8.0:1721247906.528458:0:72:0:(lov_io.c:824:lov_io_iter_init())
> > > component[0] flags 0x10
> > > 00020000:00000002:8.0:1721247906.528459:0:72:0:(lov_offset.c:266:lov_stripe_intersects())
> > > [0->-1] -> [(0) 0->-1 (0)]
> > > 00000008:00000020:8.0:1721247906.528460:0:72:0:(osc_cache.c:3049:osc_cache_writeback_range())
> > > obj 000000005da60c8b ready 0|-|- wr 0|-|- rd 0|- pageout [0,
> > > 18446744073709551615], 0.
> > > 00000080:00200000:8.0:1721247906.528462:0:72:0:(vvp_io.c:307:vvp_io_fini())
> > > [0x24000a876:0x2b:0x0] ignore/verify layout 1/0, layout version 0 need
> > > write layout 0, restore needed 0
> > > 00000080:00200000:8.0:1721247906.528463:0:72:0:(vvp_io.c:1788:vvp_io_init())
> > > [0x24000a876:0x29:0x0] ignore/verify layout 1/0, layout version 0
> > > restore needed 0
> > > 00020000:00000002:8.0:1721247906.528464:0:72:0:(lov_object.c:1486:lov_io_init())
> > > [0x24000a876:0x29:0x0]io 000000001dcefeeb type 6 ignore/verify layout
> > > 1/0
> > > 00020000:00000002:8.0:1721247906.528465:0:72:0:(lov_object.c:1182:lov_conf_freeze())
> > > To take share lov(00000000c3ae5165) owner
> > > 0000000000000000/0000000096d0da59
> > > 00020000:00000002:8.0:1721247906.528465:0:72:0:(lov_object.c:1190:lov_conf_thaw())
> > > To release share lov(00000000c3ae5165) owner
> > > 0000000000000000/0000000096d0da59
> > >
> > > On Fri, Jul 19, 2024 at 8:58 AM Michael DiDomenico
> > > <mdidomeni...@gmail.com> wrote:
> > > >
> > > > interestingly enough i ran
> > > >
> > > > cut -f4 -d: lustre_debug* | cut -f1 -d. | sort | uniq -c
> > > >
> > > > and while the timeline doesn't show a big gap, the number of records
> > > > per time interval does.  i can clearly see the slope of the job and
> > > > exactly where it seems to have fallen off
> > > >
> > > > i might be able to get the small chunk of the log exported, but i'll
> > > > paraphrase the log a little below
> > > >
> > > > starting with many entries repeat that repeat
> > > >
> > > > using 4096 grant cretis for brw
> > > > osc grant
> > > > stripe kms increasing
> > > > set kms from
> > > > mdt id initial value
> > > > mdt id on ost
> > > > comp id #-1
> > > > obj ready
> > > > osc grant
> > > >
> > > > then it stops and this comes out
> > > >
> > > > ll_mert_attr update i_sze
> > > > bf16_zero_pp_rank_7_mp_rank_01_optim_Stats.pt
> > > > obj ready update pending cmd
> > > > advancing bytes
> > > > range unlock
> > > > vvp_io_fini()
> > > > bf16_zero_pp_rank_7_mp_rank_01_optim_states.pt
> > > > writing x of x to x bytes
> > > > pos len copied
> > > > stripe kms increasing
> > > > set kms from
> > > > ll_do_tiny_write()
> > > > to take share lv owner
> > > > to release share lov owner
> > > > vfs op:onoe
> > > > to take share_lov
> > > > to_release share lov
> > > > lock match for
> > > > FLD lookup got mds 1
> > > > fld lookup got mds
> > > > CLOSE
> > > > mdc file closed with intent 200
> > > > @@@ matched open mdc
> > > > mdc modify rpc slot 1
> > > > @@@ reassign xid
> > > > sending RPC
> > > > set going to sleep for 11 seconds
> > > > completed rpc
> > > > fld lookup got mds 1
> > > > @@@ free open request
> > > >
> > > > after the last command, it then spirals out repeating until the end of 
> > > > the log
> > > >
> > > > vvp_io_init
> > > > lov_io_init
> > > > lov_conf_freeze to take share
> > > > lov_conf_thaw to release share
> > > >
> > > > On Thu, Jul 18, 2024 at 3:24 PM Oleg Drokin <gr...@whamcloud.com> wrote:
> > > > >
> > > > > On Thu, 2024-07-18 at 13:26 -0400, Michael DiDomenico wrote:
> > > > > > there's nothing 'grep -i evict' in the lustre_debug logs or from the
> > > > > > storage console logs
> > > > >
> > > > > hm, ok.
> > > > >
> > > > > > > perhaps see continuity of the timestamps and what happened right
> > > > > > > before
> > > > > > > and right after the gap if there is one in the times?
> > > > > > >
> > > > > >
> > > > > > i pulled a counter from the logs of the functions calls, maybe one 
> > > > > > of
> > > > > > these looks off (this is just the ones over 100k),  please excuse
> > > > > > typos
> > > > > >
> > > > > > $ grep -vh "^$" lustre_debug*.log | cut -f10 -d: | cut -f1 -d\) |
> > > > > > sort
> > > > > > > uniq -c | sort -n
> > > > > > 105034 lov_io_init
> > > > > > 105034 vvp_io_init
> > > > > > 105035 lov_io_iter_init
> > > > > > 105035 lob_strip_intersects
> > > > > > 105035 osc_cache_writeback_range
> > > > > > 105043 vvp_io_fini
> > > > > > 105050 lov_conf_freeze
> > > > > > 105050 lov_conf_thaw
> > > > > > 294806 osc_attr-update
> > > > > > 294806 osc_page_touch_at
> > > > > > 294814 osc_consume_write_grant
> > > > > > 294815 lov_attr_get_composite
> > > > > > 294816 osc_enter_cache_try
> > > > > > 351044 ll_write_end
> > > > > > 589549 osc_queueu_async_io
> > > > > > 589617 lov_merge_lvm_kms
> > > > >
> > > > > that's not really going to do anything useful, there's a timestamp in
> > > > > unix time as the fourth field (separated with colons), see if there 
> > > > > are
> > > > > gaps there.
> > > > > I imagine there's going to be real dense (time-wise) activity) then an
> > > > > RPC is prepared and send (Sending RPC ....) and then a lot sparser
> > > > > activity perhaps with multi-second pauses) and then eventually it'll
> > > > > pick up after gettign a server response for example?
> > > > > Though none of that explains why lctl would hang I guess, but still
> > > > >
> > > _______________________________________________
> > > lustre-discuss mailing list
> > > lustre-discuss@lists.lustre.org
> > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to