Hello Howard, Thanks for all of the help and suggestions I will look into them. I also realized that my ansible wasn't setup properly for handling tar files so the nightly build didn't even install, but will do it by hand and will give you an update tomorrow somewhere in the afternoon.
Thanks, Adam LeBlanc On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard <hpprit...@gmail.com> wrote: > Hello Adam, > > This helps some. Could you post first 20 lines of you config.log. This > will > help in trying to reproduce. The content of your host file (you can use > generic > names for the nodes if that'a an issue to publicize) would also help as > the number of nodes and number of MPI processes/node impacts the way > the reduce scatter operation works. > > One thing to note about the openib BTL - it is on life support. That's > why you needed to set btl_openib_allow_ib 1 on the mpirun command line. > > You may get much better success by installing UCX > <https://github.com/openucx/ucx/releases> and rebuilding Open MPI to use > UCX. You may actually already have UCX installed on your system if > a recent version of MOFED is installed. > > You can check this by running /usr/bin/ofed_rpm_info. It will show which > ucx version has been installed. > If UCX is installed, you can add --with-ucx to the Open MPi configuration > line and it should build in UCX > support. If Open MPI is built with UCX support, it will by default use > UCX for message transport rather than > the OpenIB BTL. > > thanks, > > Howard > > > Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc < > alebl...@iol.unh.edu>: > >> On tcp side it doesn't seg fault anymore but will timeout on some tests >> but on the openib side it will still seg fault, here is the output: >> >> [pandora:19256] *** Process received signal *** >> [pandora:19256] Signal: Segmentation fault (11) >> [pandora:19256] Signal code: Address not mapped (1) >> [pandora:19256] Failing at address: 0x7f911c69fff0 >> [pandora:19255] *** Process received signal *** >> [pandora:19255] Signal: Segmentation fault (11) >> [pandora:19255] Signal code: Address not mapped (1) >> [pandora:19255] Failing at address: 0x7ff09cd3fff0 >> [pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680] >> [pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0] >> [pandora:19256] [ 2] >> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55] >> [pandora:19256] [ 3] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b] >> [pandora:19256] [ 4] [pandora:19255] [ 0] >> /usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680] >> [pandora:19255] [ 1] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7] >> [pandora:19256] [ 5] IMB-MPI1[0x40b83b] >> [pandora:19256] [ 6] IMB-MPI1[0x407155] >> [pandora:19256] [ 7] IMB-MPI1[0x4022ea] >> [pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0] >> [pandora:19255] [ 2] >> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5] >> [pandora:19256] [ 9] IMB-MPI1[0x401d49] >> [pandora:19256] *** End of error message *** >> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55] >> [pandora:19255] [ 3] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b] >> [pandora:19255] [ 4] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7] >> [pandora:19255] [ 5] IMB-MPI1[0x40b83b] >> [pandora:19255] [ 6] IMB-MPI1[0x407155] >> [pandora:19255] [ 7] IMB-MPI1[0x4022ea] >> [pandora:19255] [ 8] >> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5] >> [pandora:19255] [ 9] IMB-MPI1[0x401d49] >> [pandora:19255] *** End of error message *** >> [phoebe:12418] *** Process received signal *** >> [phoebe:12418] Signal: Segmentation fault (11) >> [phoebe:12418] Signal code: Address not mapped (1) >> [phoebe:12418] Failing at address: 0x7f5ce27dfff0 >> [phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680] >> [phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0] >> [phoebe:12418] [ 2] >> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55] >> [phoebe:12418] [ 3] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b] >> [phoebe:12418] [ 4] >> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7] >> [phoebe:12418] [ 5] IMB-MPI1[0x40b83b] >> [phoebe:12418] [ 6] IMB-MPI1[0x407155] >> [phoebe:12418] [ 7] IMB-MPI1[0x4022ea] >> [phoebe:12418] [ 8] >> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5] >> [phoebe:12418] [ 9] IMB-MPI1[0x401d49] >> [phoebe:12418] *** End of error message *** >> -------------------------------------------------------------------------- >> Primary job terminated normally, but 1 process returned >> a non-zero exit code. Per user-direction, the job has been aborted. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 0 with PID 0 on node pandora exited on >> signal 11 (Segmentation fault). >> -------------------------------------------------------------------------- >> >> - Adam LeBlanc >> >> On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users < >> users@lists.open-mpi.org> wrote: >> >>> Can you try the latest 4.0.x nightly snapshot and see if the problem >>> still occurs? >>> >>> https://www.open-mpi.org/nightly/v4.0.x/ >>> >>> >>> > On Feb 20, 2019, at 1:40 PM, Adam LeBlanc <alebl...@iol.unh.edu> >>> wrote: >>> > >>> > I do here is the output: >>> > >>> > 2 total processes killed (some possibly by mpirun during cleanup) >>> > [pandora:12238] *** Process received signal *** >>> > [pandora:12238] Signal: Segmentation fault (11) >>> > [pandora:12238] Signal code: Invalid permissions (2) >>> > [pandora:12238] Failing at address: 0x7f5c8e31fff0 >>> > [pandora:12238] [ 0] >>> /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680] >>> > [pandora:12238] [ 1] [pandora:12237] *** Process received signal *** >>> > /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0] >>> > [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11) >>> > [pandora:12237] Signal code: Invalid permissions (2) >>> > [pandora:12237] Failing at address: 0x7f6c4ab3fff0 >>> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55] >>> > [pandora:12238] [ 3] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b] >>> > [pandora:12238] [ 4] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7] >>> > [pandora:12238] [ 5] IMB-MPI1[0x40b83b] >>> > [pandora:12238] [ 6] IMB-MPI1[0x407155] >>> > [pandora:12238] [ 7] IMB-MPI1[0x4022ea] >>> > [pandora:12238] [ 8] >>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5] >>> > [pandora:12238] [ 9] IMB-MPI1[0x401d49] >>> > [pandora:12238] *** End of error message *** >>> > [pandora:12237] [ 0] >>> /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680] >>> > [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0] >>> > [pandora:12237] [ 2] >>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55] >>> > [pandora:12237] [ 3] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b] >>> > [pandora:12237] [ 4] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7] >>> > [pandora:12237] [ 5] IMB-MPI1[0x40b83b] >>> > [pandora:12237] [ 6] IMB-MPI1[0x407155] >>> > [pandora:12237] [ 7] IMB-MPI1[0x4022ea] >>> > [pandora:12237] [ 8] >>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5] >>> > [pandora:12237] [ 9] IMB-MPI1[0x401d49] >>> > [pandora:12237] *** End of error message *** >>> > [phoebe:07408] *** Process received signal *** >>> > [phoebe:07408] Signal: Segmentation fault (11) >>> > [phoebe:07408] Signal code: Invalid permissions (2) >>> > [phoebe:07408] Failing at address: 0x7f6b9ca9fff0 >>> > [titan:07169] *** Process received signal *** >>> > [titan:07169] Signal: Segmentation fault (11) >>> > [titan:07169] Signal code: Invalid permissions (2) >>> > [titan:07169] Failing at address: 0x7fc01295fff0 >>> > [phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680] >>> > [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0] >>> > [phoebe:07408] [ 2] [titan:07169] [ 0] >>> /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680] >>> > [titan:07169] [ 1] >>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55] >>> > [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0] >>> > [titan:07169] [ 2] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b] >>> > [phoebe:07408] [ 4] >>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55] >>> > [titan:07169] [ 3] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7] >>> > [phoebe:07408] [ 5] IMB-MPI1[0x40b83b] >>> > [phoebe:07408] [ 6] IMB-MPI1[0x407155] >>> > >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b] >>> > [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea] >>> > [phoebe:07408] [ 8] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7] >>> > [titan:07169] [ 5] IMB-MPI1[0x40b83b] >>> > [titan:07169] [ 6] IMB-MPI1[0x407155] >>> > [titan:07169] [ 7] >>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5] >>> > [phoebe:07408] [ 9] IMB-MPI1[0x401d49] >>> > [phoebe:07408] *** End of error message *** >>> > IMB-MPI1[0x4022ea] >>> > [titan:07169] [ 8] >>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5] >>> > [titan:07169] [ 9] IMB-MPI1[0x401d49] >>> > [titan:07169] *** End of error message *** >>> > >>> -------------------------------------------------------------------------- >>> > Primary job terminated normally, but 1 process returned >>> > a non-zero exit code. Per user-direction, the job has been aborted. >>> > >>> -------------------------------------------------------------------------- >>> > >>> -------------------------------------------------------------------------- >>> > mpirun noticed that process rank 0 with PID 0 on node pandora exited >>> on signal 11 (Segmentation fault). >>> > >>> -------------------------------------------------------------------------- >>> > >>> > >>> > - Adam LeBlanc >>> > >>> > On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard <hpprit...@gmail.com> >>> wrote: >>> > HI Adam, >>> > >>> > As a sanity check, if you try to use --mca btl self,vader,tcp >>> > >>> > do you still see the segmentation fault? >>> > >>> > Howard >>> > >>> > >>> > Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc < >>> alebl...@iol.unh.edu>: >>> > Hello, >>> > >>> > When I do a run with OpenMPI v4.0.0 on Infiniband with this command: >>> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca >>> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca >>> btl_openib_allow_ib 1 -np 6 >>> > -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1 >>> > >>> > I get this error: >>> > >>> > #---------------------------------------------------------------- >>> > # Benchmarking Reduce_scatter >>> > # #processes = 4 >>> > # ( 2 additional processes waiting in MPI_Barrier) >>> > #---------------------------------------------------------------- >>> > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] >>> > 0 1000 0.14 0.15 0.14 >>> > 4 1000 5.00 7.58 6.28 >>> > 8 1000 5.13 7.68 6.41 >>> > 16 1000 5.05 7.74 6.39 >>> > 32 1000 5.43 7.96 6.75 >>> > 64 1000 6.78 8.56 7.69 >>> > 128 1000 7.77 9.55 8.59 >>> > 256 1000 8.28 10.96 9.66 >>> > 512 1000 9.19 12.49 10.85 >>> > 1024 1000 11.78 15.01 13.38 >>> > 2048 1000 17.41 19.51 18.52 >>> > 4096 1000 25.73 28.22 26.89 >>> > 8192 1000 47.75 49.44 48.79 >>> > 16384 1000 81.10 90.15 84.75 >>> > 32768 1000 163.01 178.58 173.19 >>> > 65536 640 315.63 340.51 333.18 >>> > 131072 320 475.48 528.82 510.85 >>> > 262144 160 979.70 1063.81 1035.61 >>> > 524288 80 2070.51 2242.58 2150.15 >>> > 1048576 40 4177.36 4527.25 4431.65 >>> > 2097152 20 8738.08 9340.50 9147.89 >>> > [pandora:04500] *** Process received signal *** >>> > [pandora:04500] Signal: Segmentation fault (11) >>> > [pandora:04500] Signal code: Address not mapped (1) >>> > [pandora:04500] Failing at address: 0x7f310ebffff0 >>> > [pandora:04499] *** Process received signal *** >>> > [pandora:04499] Signal: Segmentation fault (11) >>> > [pandora:04499] Signal code: Address not mapped (1) >>> > [pandora:04499] Failing at address: 0x7f28b11ffff0 >>> > [pandora:04500] [ 0] >>> /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680] >>> > [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0] >>> > [pandora:04500] [ 2] >>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55] >>> > [pandora:04500] [ 3] [pandora:04499] [ 0] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b] >>> > [pandora:04500] [ 4] >>> /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680] >>> > [pandora:04499] [ 1] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7] >>> > [pandora:04500] [ 5] IMB-MPI1[0x40b83b] >>> > [pandora:04500] [ 6] IMB-MPI1[0x407155] >>> > [pandora:04500] [ 7] IMB-MPI1[0x4022ea] >>> > [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0] >>> > [pandora:04499] [ 2] >>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5] >>> > [pandora:04500] [ 9] IMB-MPI1[0x401d49] >>> > [pandora:04500] *** End of error message *** >>> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55] >>> > [pandora:04499] [ 3] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b] >>> > [pandora:04499] [ 4] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7] >>> > [pandora:04499] [ 5] IMB-MPI1[0x40b83b] >>> > [pandora:04499] [ 6] IMB-MPI1[0x407155] >>> > [pandora:04499] [ 7] IMB-MPI1[0x4022ea] >>> > [pandora:04499] [ 8] >>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5] >>> > [pandora:04499] [ 9] IMB-MPI1[0x401d49] >>> > [pandora:04499] *** End of error message *** >>> > [phoebe:03779] *** Process received signal *** >>> > [phoebe:03779] Signal: Segmentation fault (11) >>> > [phoebe:03779] Signal code: Address not mapped (1) >>> > [phoebe:03779] Failing at address: 0x7f483d6ffff0 >>> > [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680] >>> > [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0] >>> > [phoebe:03779] [ 2] >>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55] >>> > [phoebe:03779] [ 3] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b] >>> > [phoebe:03779] [ 4] >>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7] >>> > [phoebe:03779] [ 5] IMB-MPI1[0x40b83b] >>> > [phoebe:03779] [ 6] IMB-MPI1[0x407155] >>> > [phoebe:03779] [ 7] IMB-MPI1[0x4022ea] >>> > [phoebe:03779] [ 8] >>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5] >>> > [phoebe:03779] [ 9] IMB-MPI1[0x401d49] >>> > [phoebe:03779] *** End of error message *** >>> > >>> -------------------------------------------------------------------------- >>> > Primary job terminated normally, but 1 process returned >>> > a non-zero exit code. Per user-direction, the job has been aborted. >>> > >>> -------------------------------------------------------------------------- >>> > >>> -------------------------------------------------------------------------- >>> > mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib >>> exited on signal 11 (Segmentation fault). >>> > >>> -------------------------------------------------------------------------- >>> > >>> > Also if I reinstall 3.1.2 I do not have this issue at all. >>> > >>> > Any thoughts on what could be the issue? >>> > >>> > Thanks, >>> > Adam LeBlanc >>> > _______________________________________________ >>> > users mailing list >>> > users@lists.open-mpi.org >>> > https://lists.open-mpi.org/mailman/listinfo/users >>> > _______________________________________________ >>> > users mailing list >>> > users@lists.open-mpi.org >>> > https://lists.open-mpi.org/mailman/listinfo/users >>> > _______________________________________________ >>> > users mailing list >>> > users@lists.open-mpi.org >>> > https://lists.open-mpi.org/mailman/listinfo/users >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >>> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users