Re: [OMPI users] Deadlock in netcdf tests
Orion, It might be a good idea. This bug is triggered from the fcoll/two_phase component (and having spent just two minutes in looking at it, I have a suspicion what triggers it, namely in int vs. long conversion issue), so it is probably unrelated to the other one. I need to add running the netcdf test cases to my list of standard testsuites, but we didn't used to have any problems with them :-( Thanks for the report, we will be working on them! Edgar > -Original Message- > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Orion > Poplawski via users > Sent: Friday, October 25, 2019 10:21 PM > To: Open MPI Users > Cc: Orion Poplawski > Subject: Re: [OMPI users] Deadlock in netcdf tests > > Thanks for the response, the workaround helps. > > With that out of the way I see: > > + mpiexec -n 4 ./tst_parallel4 > Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >= > num_aggregators(1)fd_size=461172966257152 off=4156705856 > Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >= > num_aggregators(1)fd_size=4611731477435006976 off=4157193280 > > Should I file issues for both of these? > > On 10/25/19 2:29 AM, Gilles Gouaillardet via users wrote: > > Orion, > > > > > > thanks for the report. > > > > > > I can confirm this is indeed an Open MPI bug. > > > > FWIW, a workaround is to disable the fcoll/vulcan component. > > > > That can be achieved by > > > > mpirun --mca fcoll ^vulcan ... > > > > or > > > > OMPI_MCA_fcoll=^vulcan mpirun ... > > > > > > I also noted the tst_parallel3 program crashes with the ROMIO component. > > > > > > Cheers, > > > > > > Gilles > > > > On 10/25/2019 12:55 PM, Orion Poplawski via users wrote: > >> On 10/24/19 9:28 PM, Orion Poplawski via users wrote: > >>> Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are > >>> seeing a test hang with openmpi 4.0.2. Backtrace: > >>> > >>> (gdb) bt > >>> #0 0x7f90c197529b in sched_yield () from /lib64/libc.so.6 > >>> #1 0x7f90c1ac8a05 in ompi_request_default_wait () from > >>> /usr/lib64/openmpi/lib/libmpi.so.40 > >>> #2 0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from > >>> /usr/lib64/openmpi/lib/libmpi.so.40 > >>> #3 0x7f90c1b2bb73 in > >>> ompi_coll_base_allreduce_intra_recursivedoubling () from > >>> /usr/lib64/openmpi/lib/libmpi.so.40 > >>> #4 0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from > >>> /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so > >>> #5 0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from > >>> /usr/lib64/openmpi/lib/libmca_common_ompio.so.41 > >>> #6 0x7f90beb0610b in mca_io_ompio_file_write_at_all () from > >>> /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so > >>> #7 0x7f90c1af033f in PMPI_File_write_at_all () from > >>> /usr/lib64/openmpi/lib/libmpi.so.40 > >>> #8 0x7f90c1627d7b in H5FD_mpio_write () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #9 0x7f90c14636ee in H5FD_write () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #10 0x7f90c1442eb3 in H5F__accum_write () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #11 0x7f90c1543729 in H5PB_write () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #12 0x7f90c144d69c in H5F_block_write () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #13 0x7f90c161cd10 in H5C_apply_candidate_list () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #14 0x7f90c161ad02 in H5AC__run_sync_point () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #15 0x7f90c161bd4f in H5AC__flush_entries () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #16 0x7f90c13b154d in H5AC_flush () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #18 0x7f90c1448e64 in H5F__flush () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #20 0x7f90c144f171 in H5F_flush_mounts () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #21 0x7f90c143e3a5 in H5Fflush () from > >>> /usr/lib64/openmpi/lib/libhdf5.so.103 > >>> #22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at > >>> ../../libhdf5/hdf5file.c:222 > >>> #23 0x7f90c1c1816e in NC4_enddef (ncid=) at > >>> ../../libhdf5/hdf5file.c:544 > >>> #24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at > >>> ../../libdispatch/dfile.c:1004 > >>> #25 0x56527d0def27 in test_pio (flag=0) at > >>> ../../nc_test4/tst_parallel3.c:206 > >>> #26 0x56527d0de62c in main (argc=, > argv= >>> out>) at ../../nc_test4/tst_parallel3.c:91 > >>> > >>> processes are running full out. > >>> > >>> Suggestions for debugging this would be greatly appreciated. > >>> > >> > >> Some more info - I think now it is more dependent on openmpi versions > >> than netcdf itself: > >> > >> - last successful build was with netcdf 4.7
Re: [OMPI users] Deadlock in netcdf tests
Okay, I've filed: https://github.com/open-mpi/ompi/issues/7109 - deadlock and https://github.com/open-mpi/ompi/issues/7110 - ompio error I've found the hdf5 and netcdf testsuites quite adept at finding issues with openmpi over the years. Thanks again for the help. On 10/26/19 6:01 AM, Gabriel, Edgar wrote: Orion, It might be a good idea. This bug is triggered from the fcoll/two_phase component (and having spent just two minutes in looking at it, I have a suspicion what triggers it, namely in int vs. long conversion issue), so it is probably unrelated to the other one. I need to add running the netcdf test cases to my list of standard testsuites, but we didn't used to have any problems with them :-( Thanks for the report, we will be working on them! Edgar -Original Message- From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Orion Poplawski via users Sent: Friday, October 25, 2019 10:21 PM To: Open MPI Users Cc: Orion Poplawski Subject: Re: [OMPI users] Deadlock in netcdf tests Thanks for the response, the workaround helps. With that out of the way I see: + mpiexec -n 4 ./tst_parallel4 Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >= num_aggregators(1)fd_size=461172966257152 off=4156705856 Error in ompi_io_ompio_calcl_aggregator():rank_index(-2) >= num_aggregators(1)fd_size=4611731477435006976 off=4157193280 Should I file issues for both of these? On 10/25/19 2:29 AM, Gilles Gouaillardet via users wrote: Orion, thanks for the report. I can confirm this is indeed an Open MPI bug. FWIW, a workaround is to disable the fcoll/vulcan component. That can be achieved by mpirun --mca fcoll ^vulcan ... or OMPI_MCA_fcoll=^vulcan mpirun ... I also noted the tst_parallel3 program crashes with the ROMIO component. Cheers, Gilles On 10/25/2019 12:55 PM, Orion Poplawski via users wrote: On 10/24/19 9:28 PM, Orion Poplawski via users wrote: Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are seeing a test hang with openmpi 4.0.2. Backtrace: (gdb) bt #0 0x7f90c197529b in sched_yield () from /lib64/libc.so.6 #1 0x7f90c1ac8a05 in ompi_request_default_wait () from /usr/lib64/openmpi/lib/libmpi.so.40 #2 0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from /usr/lib64/openmpi/lib/libmpi.so.40 #3 0x7f90c1b2bb73 in ompi_coll_base_allreduce_intra_recursivedoubling () from /usr/lib64/openmpi/lib/libmpi.so.40 #4 0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so #5 0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from /usr/lib64/openmpi/lib/libmca_common_ompio.so.41 #6 0x7f90beb0610b in mca_io_ompio_file_write_at_all () from /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so #7 0x7f90c1af033f in PMPI_File_write_at_all () from /usr/lib64/openmpi/lib/libmpi.so.40 #8 0x7f90c1627d7b in H5FD_mpio_write () from /usr/lib64/openmpi/lib/libhdf5.so.103 #9 0x7f90c14636ee in H5FD_write () from /usr/lib64/openmpi/lib/libhdf5.so.103 #10 0x7f90c1442eb3 in H5F__accum_write () from /usr/lib64/openmpi/lib/libhdf5.so.103 #11 0x7f90c1543729 in H5PB_write () from /usr/lib64/openmpi/lib/libhdf5.so.103 #12 0x7f90c144d69c in H5F_block_write () from /usr/lib64/openmpi/lib/libhdf5.so.103 #13 0x7f90c161cd10 in H5C_apply_candidate_list () from /usr/lib64/openmpi/lib/libhdf5.so.103 #14 0x7f90c161ad02 in H5AC__run_sync_point () from /usr/lib64/openmpi/lib/libhdf5.so.103 #15 0x7f90c161bd4f in H5AC__flush_entries () from /usr/lib64/openmpi/lib/libhdf5.so.103 #16 0x7f90c13b154d in H5AC_flush () from /usr/lib64/openmpi/lib/libhdf5.so.103 #17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from /usr/lib64/openmpi/lib/libhdf5.so.103 #18 0x7f90c1448e64 in H5F__flush () from /usr/lib64/openmpi/lib/libhdf5.so.103 #19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from /usr/lib64/openmpi/lib/libhdf5.so.103 #20 0x7f90c144f171 in H5F_flush_mounts () from /usr/lib64/openmpi/lib/libhdf5.so.103 #21 0x7f90c143e3a5 in H5Fflush () from /usr/lib64/openmpi/lib/libhdf5.so.103 #22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at ../../libhdf5/hdf5file.c:222 #23 0x7f90c1c1816e in NC4_enddef (ncid=) at ../../libhdf5/hdf5file.c:544 #24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at ../../libdispatch/dfile.c:1004 #25 0x56527d0def27 in test_pio (flag=0) at ../../nc_test4/tst_parallel3.c:206 #26 0x56527d0de62c in main (argc=, argv= out>) at ../../nc_test4/tst_parallel3.c:91 processes are running full out. Suggestions for debugging this would be greatly appreciated. Some more info - I think now it is more dependent on openmpi versions than netcdf itself: - last successful build was with netcdf 4.7.0, openmpi 4.0.1, ucx 1.5.2, pmix-3.1.4. Possible start of the failure was with openmpi 4.0.2-rc1 and ucx 1.6.0. - netcdf 4.7.0 test hangs on Fedora Rawhide (F32) with openmpi 4.0.2, ucx 1.6.1, pmix 3.1.4 - net