[OMPI users] Deadlock in netcdf tests

2019-10-24 Thread Orion Poplawski via users
Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are seeing a 
test hang with openmpi 4.0.2.  Backtrace:


(gdb) bt
#0  0x7f90c197529b in sched_yield () from /lib64/libc.so.6
#1  0x7f90c1ac8a05 in ompi_request_default_wait () from 
/usr/lib64/openmpi/lib/libmpi.so.40
#2  0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from 
/usr/lib64/openmpi/lib/libmpi.so.40
#3  0x7f90c1b2bb73 in 
ompi_coll_base_allreduce_intra_recursivedoubling () from 
/usr/lib64/openmpi/lib/libmpi.so.40
#4  0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from 
/usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so
#5  0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from 
/usr/lib64/openmpi/lib/libmca_common_ompio.so.41
#6  0x7f90beb0610b in mca_io_ompio_file_write_at_all () from 
/usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so
#7  0x7f90c1af033f in PMPI_File_write_at_all () from 
/usr/lib64/openmpi/lib/libmpi.so.40
#8  0x7f90c1627d7b in H5FD_mpio_write () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#9  0x7f90c14636ee in H5FD_write () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#10 0x7f90c1442eb3 in H5F__accum_write () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#11 0x7f90c1543729 in H5PB_write () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#12 0x7f90c144d69c in H5F_block_write () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#13 0x7f90c161cd10 in H5C_apply_candidate_list () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#14 0x7f90c161ad02 in H5AC__run_sync_point () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#15 0x7f90c161bd4f in H5AC__flush_entries () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#16 0x7f90c13b154d in H5AC_flush () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#18 0x7f90c1448e64 in H5F__flush () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#20 0x7f90c144f171 in H5F_flush_mounts () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#21 0x7f90c143e3a5 in H5Fflush () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at 
../../libhdf5/hdf5file.c:222
#23 0x7f90c1c1816e in NC4_enddef (ncid=) at 
../../libhdf5/hdf5file.c:544
#24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at 
../../libdispatch/dfile.c:1004
#25 0x56527d0def27 in test_pio (flag=0) at 
../../nc_test4/tst_parallel3.c:206
#26 0x56527d0de62c in main (argc=, argv=out>) at ../../nc_test4/tst_parallel3.c:91


processes are running full out.

Suggestions for debugging this would be greatly appreciated.

--
Orion Poplawski
Manager of NWRA Technical Systems  720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane   or...@nwra.com
Boulder, CO 80301 https://www.nwra.com/



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] Deadlock in netcdf tests

2019-10-24 Thread Orion Poplawski via users

On 10/24/19 9:28 PM, Orion Poplawski via users wrote:
Starting with netcdf 4.7.1 (and 4.7.2) in Fedora Rawhide we are seeing a 
test hang with openmpi 4.0.2.  Backtrace:


(gdb) bt
#0  0x7f90c197529b in sched_yield () from /lib64/libc.so.6
#1  0x7f90c1ac8a05 in ompi_request_default_wait () from 
/usr/lib64/openmpi/lib/libmpi.so.40
#2  0x7f90c1b2b35c in ompi_coll_base_sendrecv_actual () from 
/usr/lib64/openmpi/lib/libmpi.so.40
#3  0x7f90c1b2bb73 in 
ompi_coll_base_allreduce_intra_recursivedoubling () from 
/usr/lib64/openmpi/lib/libmpi.so.40
#4  0x7f90be96e9c5 in mca_fcoll_vulcan_file_write_all () from 
/usr/lib64/openmpi/lib/openmpi/mca_fcoll_vulcan.so
#5  0x7f90be9fada0 in mca_common_ompio_file_write_at_all () from 
/usr/lib64/openmpi/lib/libmca_common_ompio.so.41
#6  0x7f90beb0610b in mca_io_ompio_file_write_at_all () from 
/usr/lib64/openmpi/lib/openmpi/mca_io_ompio.so
#7  0x7f90c1af033f in PMPI_File_write_at_all () from 
/usr/lib64/openmpi/lib/libmpi.so.40
#8  0x7f90c1627d7b in H5FD_mpio_write () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#9  0x7f90c14636ee in H5FD_write () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#10 0x7f90c1442eb3 in H5F__accum_write () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#11 0x7f90c1543729 in H5PB_write () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#12 0x7f90c144d69c in H5F_block_write () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#13 0x7f90c161cd10 in H5C_apply_candidate_list () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#14 0x7f90c161ad02 in H5AC__run_sync_point () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#15 0x7f90c161bd4f in H5AC__flush_entries () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#16 0x7f90c13b154d in H5AC_flush () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#17 0x7f90c1446761 in H5F__flush_phase2.part.0 () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#18 0x7f90c1448e64 in H5F__flush () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#19 0x7f90c144dc08 in H5F_flush_mounts_recurse () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#20 0x7f90c144f171 in H5F_flush_mounts () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#21 0x7f90c143e3a5 in H5Fflush () from 
/usr/lib64/openmpi/lib/libhdf5.so.103
#22 0x7f90c1c178c0 in sync_netcdf4_file (h5=0x56527e439b10) at 
../../libhdf5/hdf5file.c:222
#23 0x7f90c1c1816e in NC4_enddef (ncid=) at 
../../libhdf5/hdf5file.c:544
#24 0x7f90c1bd94f3 in nc_enddef (ncid=65536) at 
../../libdispatch/dfile.c:1004
#25 0x56527d0def27 in test_pio (flag=0) at 
../../nc_test4/tst_parallel3.c:206
#26 0x56527d0de62c in main (argc=, argv=out>) at ../../nc_test4/tst_parallel3.c:91


processes are running full out.

Suggestions for debugging this would be greatly appreciated.



Some more info - I think now it is more dependent on openmpi versions 
than netcdf itself:


- last successful build was with netcdf 4.7.0, openmpi 4.0.1, ucx 1.5.2, 
pmix-3.1.4.  Possible start of the failure was with openmpi 4.0.2-rc1 
and ucx 1.6.0.


- netcdf 4.7.0 test hangs on Fedora Rawhide (F32) with openmpi 4.0.2, 
ucx 1.6.1, pmix 3.1.4


- netcdf 4.7.0 test hangs on Fedora F31 with openmpi 4.0.2rc2 with 
internal UCX.


--
Orion Poplawski
Manager of NWRA Technical Systems  720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane   or...@nwra.com
Boulder, CO 80301 https://www.nwra.com/



smime.p7s
Description: S/MIME Cryptographic Signature