Never mind, I see it in the backtrace :-)
Will look into it, but am currently traveling. Until then, Gilles suggestion is 
probably the right approach.
Thanks
Edgar

> -----Original Message-----
> From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Gabriel,
> Edgar via users
> Sent: Friday, October 25, 2019 7:43 AM
> To: Open MPI Users <users@lists.open-mpi.org>
> Cc: Gabriel, Edgar <egabr...@central.uh.edu>
> Subject: Re: [OMPI users] Deadlock in netcdf tests
> 
> Orion,
>  I will look into this problem, is there a specific code or testcase that 
> triggers
> this problem?
> Thanks
> Edgar
> 
> > -----Original Message-----
> > From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of
> > Orion Poplawski via users
> > Sent: Thursday, October 24, 2019 11:56 PM
> > To: Open MPI Users <users@lists.open-mpi.org>
> > Cc: Orion Poplawski <or...@nwra.com>
> > Subject: Re: [OMPI users] Deadlock in netcdf tests
> >
> > On 10/24/19 9:28 PM, Orion Poplawski via users wrote:
> > > Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are
> > > seeing a test hang with openmpi 4.0.2.  Backtrace:
> > >
> > > (gdb) bt
> > > #0  0x00007f90c197529b in sched_yield () from /lib64/libc.so.6
> > > #1  0x00007f90c1ac8a05 in ompi_request_default_wait () from
> > > /usr/lib64/openmpi/lib/libmpi.so.40
> > > #2  0x00007f90c1b2b35c in ompi_coll_base_sendrecv_actual () from
> > > /usr/lib64/openmpi/lib/libmpi.so.40
> > > #3  0x00007f90c1b2bb73 in
> > > ompi_coll_base_allreduce_intra_recursivedoubling () from
> > > /usr/lib64/openmpi/lib/libmpi.so.40
> > > #4  0x00007f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from
> > > /usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so
> > > #5  0x00007f90be9fada0 in mca_common_ompio_file_write_at_all () from
> > > /usr/lib64/openmpi/lib/libmca_common_ompio.so.41
> > > #6  0x00007f90beb0610b in mca_io_ompio_file_write_at_all () from
> > > /usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so
> > > #7  0x00007f90c1af033f in PMPI_File_write_at_all () from
> > > /usr/lib64/openmpi/lib/libmpi.so.40
> > > #8  0x00007f90c1627d7b in H5FD_mpio_write () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #9  0x00007f90c14636ee in H5FD_write () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #10 0x00007f90c1442eb3 in H5F__accum_write () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #11 0x00007f90c1543729 in H5PB_write () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #12 0x00007f90c144d69c in H5F_block_write () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #13 0x00007f90c161cd10 in H5C_apply_candidate_list () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #14 0x00007f90c161ad02 in H5AC__run_sync_point () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #15 0x00007f90c161bd4f in H5AC__flush_entries () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #16 0x00007f90c13b154d in H5AC_flush () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #17 0x00007f90c1446761 in H5F__flush_phase2.part.0 () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #18 0x00007f90c1448e64 in H5F__flush () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #19 0x00007f90c144dc08 in H5F_flush_mounts_recurse () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #20 0x00007f90c144f171 in H5F_flush_mounts () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #21 0x00007f90c143e3a5 in H5Fflush () from
> > > /usr/lib64/openmpi/lib/libhdf5.so.103
> > > #22 0x00007f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at
> > > ../../libhdf5/hdf5file.c:222
> > > #23 0x00007f90c1c1816e in NC4_enddef (ncid=<optimized out>) at
> > > ../../libhdf5/hdf5file.c:544
> > > #24 0x00007f90c1bd94f3 in nc_enddef (ncid=65536) at
> > > ../../libdispatch/dfile.c:1004
> > > #25 0x000056527d0def27 in test_pio (flag=0) at
> > > ../../nc_test4/tst_parallel3.c:206
> > > #26 0x000056527d0de62c in main (argc=<optimized out>,
> > > argv=<optimized
> > > out>) at ../../nc_test4/tst_parallel3.c:91
> > >
> > > processes are running full out.
> > >
> > > Suggestions for debugging this would be greatly appreciated.
> > >
> >
> > Some more info - I think now it is more dependent on openmpi versions
> > than netcdf itself:
> >
> > - last successful build was with netcdf 4.7.0, openmpi 4.0.1, ucx
> > 1.5.2, pmix-3.1.4.  Possible start of the failure was with openmpi
> > 4.0.2-rc1 and ucx 1.6.0.
> >
> > - netcdf 4.7.0 test hangs on Fedora Rawhide (F32) with openmpi 4.0.2,
> > ucx 1.6.1, pmix 3.1.4
> >
> > - netcdf 4.7.0 test hangs on Fedora F31 with openmpi 4.0.2rc2 with
> > internal UCX.
> >
> > --
> > Orion Poplawski
> > Manager of NWRA Technical Systems          720-772-5637
> > NWRA, Boulder/CoRA Office             FAX: 303-415-9702
> > 3380 Mitchell Lane                       or...@nwra.com
> > Boulder, CO 80301                 https://www.nwra.com/

Reply via email to