Can you try the latest 4.0.x nightly snapshot and see if the problem still
occurs?
https://www.open-mpi.org/nightly/v4.0.x/
> On Feb 20, 2019, at 1:40 PM, Adam LeBlanc <[email protected]> wrote:
>
> I do here is the output:
>
> 2 total processes killed (some possibly by mpirun during cleanup)
> [pandora:12238] *** Process received signal ***
> [pandora:12238] Signal: Segmentation fault (11)
> [pandora:12238] Signal code: Invalid permissions (2)
> [pandora:12238] Failing at address: 0x7f5c8e31fff0
> [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
> [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
> /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
> [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
> [pandora:12237] Signal code: Invalid permissions (2)
> [pandora:12237] Failing at address: 0x7f6c4ab3fff0
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
> [pandora:12238] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
> [pandora:12238] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
> [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
> [pandora:12238] [ 6] IMB-MPI1[0x407155]
> [pandora:12238] [ 7] IMB-MPI1[0x4022ea]
> [pandora:12238] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
> [pandora:12238] [ 9] IMB-MPI1[0x401d49]
> [pandora:12238] *** End of error message ***
> [pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
> [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
> [pandora:12237] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
> [pandora:12237] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
> [pandora:12237] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
> [pandora:12237] [ 5] IMB-MPI1[0x40b83b]
> [pandora:12237] [ 6] IMB-MPI1[0x407155]
> [pandora:12237] [ 7] IMB-MPI1[0x4022ea]
> [pandora:12237] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
> [pandora:12237] [ 9] IMB-MPI1[0x401d49]
> [pandora:12237] *** End of error message ***
> [phoebe:07408] *** Process received signal ***
> [phoebe:07408] Signal: Segmentation fault (11)
> [phoebe:07408] Signal code: Invalid permissions (2)
> [phoebe:07408] Failing at address: 0x7f6b9ca9fff0
> [titan:07169] *** Process received signal ***
> [titan:07169] Signal: Segmentation fault (11)
> [titan:07169] Signal code: Invalid permissions (2)
> [titan:07169] Failing at address: 0x7fc01295fff0
> [phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
> [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
> [phoebe:07408] [ 2] [titan:07169] [ 0]
> /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
> [titan:07169] [ 1]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
> [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
> [titan:07169] [ 2]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
> [phoebe:07408] [ 4]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
> [titan:07169] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
> [phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:07408] [ 6] IMB-MPI1[0x407155]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
> [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:07408] [ 8]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
> [titan:07169] [ 5] IMB-MPI1[0x40b83b]
> [titan:07169] [ 6] IMB-MPI1[0x407155]
> [titan:07169] [ 7]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
> [phoebe:07408] [ 9] IMB-MPI1[0x401d49]
> [phoebe:07408] *** End of error message ***
> IMB-MPI1[0x4022ea]
> [titan:07169] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5]
> [titan:07169] [ 9] IMB-MPI1[0x401d49]
> [titan:07169] *** End of error message ***
> --------------------------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node pandora exited on
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
>
> - Adam LeBlanc
>
> On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard <[email protected]> wrote:
> HI Adam,
>
> As a sanity check, if you try to use --mca btl self,vader,tcp
>
> do you still see the segmentation fault?
>
> Howard
>
>
> Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc
> <[email protected]>:
> Hello,
>
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun
> --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
> btl_openib_allow_ib 1 -np 6
> -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
>
> I get this error:
>
> #----------------------------------------------------------------
> # Benchmarking Reduce_scatter
> # #processes = 4
> # ( 2 additional processes waiting in MPI_Barrier)
> #----------------------------------------------------------------
> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec]
> 0 1000 0.14 0.15 0.14
> 4 1000 5.00 7.58 6.28
> 8 1000 5.13 7.68 6.41
> 16 1000 5.05 7.74 6.39
> 32 1000 5.43 7.96 6.75
> 64 1000 6.78 8.56 7.69
> 128 1000 7.77 9.55 8.59
> 256 1000 8.28 10.96 9.66
> 512 1000 9.19 12.49 10.85
> 1024 1000 11.78 15.01 13.38
> 2048 1000 17.41 19.51 18.52
> 4096 1000 25.73 28.22 26.89
> 8192 1000 47.75 49.44 48.79
> 16384 1000 81.10 90.15 84.75
> 32768 1000 163.01 178.58 173.19
> 65536 640 315.63 340.51 333.18
> 131072 320 475.48 528.82 510.85
> 262144 160 979.70 1063.81 1035.61
> 524288 80 2070.51 2242.58 2150.15
> 1048576 40 4177.36 4527.25 4431.65
> 2097152 20 8738.08 9340.50 9147.89
> [pandora:04500] *** Process received signal ***
> [pandora:04500] Signal: Segmentation fault (11)
> [pandora:04500] Signal code: Address not mapped (1)
> [pandora:04500] Failing at address: 0x7f310ebffff0
> [pandora:04499] *** Process received signal ***
> [pandora:04499] Signal: Segmentation fault (11)
> [pandora:04499] Signal code: Address not mapped (1)
> [pandora:04499] Failing at address: 0x7f28b11ffff0
> [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
> [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
> [pandora:04500] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
> [pandora:04500] [ 3] [pandora:04499] [ 0]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
> [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
> [pandora:04499] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
> [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04500] [ 6] IMB-MPI1[0x407155]
> [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
> [pandora:04499] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
> [pandora:04500] [ 9] IMB-MPI1[0x401d49]
> [pandora:04500] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
> [pandora:04499] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
> [pandora:04499] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
> [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04499] [ 6] IMB-MPI1[0x407155]
> [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04499] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
> [pandora:04499] [ 9] IMB-MPI1[0x401d49]
> [pandora:04499] *** End of error message ***
> [phoebe:03779] *** Process received signal ***
> [phoebe:03779] Signal: Segmentation fault (11)
> [phoebe:03779] Signal code: Address not mapped (1)
> [phoebe:03779] Failing at address: 0x7f483d6ffff0
> [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
> [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
> [phoebe:03779] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
> [phoebe:03779] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
> [phoebe:03779] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
> [phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:03779] [ 6] IMB-MPI1[0x407155]
> [phoebe:03779] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:03779] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5]
> [phoebe:03779] [ 9] IMB-MPI1[0x401d49]
> [phoebe:03779] *** End of error message ***
> --------------------------------------------------------------------------
> Primary job terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib exited on
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
>
> Also if I reinstall 3.1.2 I do not have this issue at all.
>
> Any thoughts on what could be the issue?
>
> Thanks,
> Adam LeBlanc
> _______________________________________________
> users mailing list
> [email protected]
> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> [email protected]
> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> [email protected]
> https://lists.open-mpi.org/mailman/listinfo/users
--
Jeff Squyres
[email protected]
_______________________________________________
users mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/users