Hello Adam,

 

During the InfiniBand Plugfest 34 event last October, we found that mpirun hang 
on FDR systems if you run with the openib btl option.

 

Yossi Itigin (@Mellanox) suggested that we run using the following options:

        --mca btl self,vader --mca pml ucx -x UCX_RC_PATH_MTU=4096

 

If you still have trouble, please try the above options (& per Howard’s 
suggestion) and see if that resolves your troubles.

 

Thanks.

--

Llolsten

 

From: users <users-boun...@lists.open-mpi.org> On Behalf Of Adam LeBlanc
Sent: Wednesday, February 20, 2019 5:18 PM
To: Open MPI Users <users@lists.open-mpi.org>
Subject: Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

 

Hello Howard,

 

Thanks for all of the help and suggestions I will look into them. I also 
realized that my ansible wasn't setup properly for handling tar files so the 
nightly build didn't even install, but will do it by hand and will give you an 
update tomorrow somewhere in the afternoon.

 

Thanks,

Adam LeBlanc

 

On Wed, Feb 20, 2019 at 4:26 PM Howard Pritchard <hpprit...@gmail.com 
<mailto:hpprit...@gmail.com> > wrote:

Hello Adam,

 

This helps some.  Could you post first 20 lines of you config.log.  This will

help in trying to reproduce.  The content of your host file (you can use generic

names for the nodes if that'a an issue to publicize) would also help as

the number of nodes and number of MPI processes/node impacts the way

the reduce scatter operation works.

 

One thing to note about the openib BTL - it is on life support.   That's

why you needed to set btl_openib_allow_ib 1 on the mpirun command line.

 

You may get much better success by installing UCX 
<https://github.com/openucx/ucx/releases>  and rebuilding Open MPI to use UCX.  
You may actually already have UCX installed on your system if

a recent version of MOFED is installed.

 

You can check this by running /usr/bin/ofed_rpm_info.  It will show which ucx 
version has been installed.

If UCX is installed, you can add --with-ucx to the Open MPi configuration line 
and it should build in UCX

support.   If Open MPI is built with UCX support, it will by default use UCX 
for message transport rather than

the OpenIB BTL.

 

thanks,

 

Howard

 

 

Am Mi., 20. Feb. 2019 um 12:49 Uhr schrieb Adam LeBlanc <alebl...@iol.unh.edu 
<mailto:alebl...@iol.unh.edu> >:

On tcp side it doesn't seg fault anymore but will timeout on some tests but on 
the openib side it will still seg fault, here is the output:

 

[pandora:19256] *** Process received signal ***

[pandora:19256] Signal: Segmentation fault (11)

[pandora:19256] Signal code: Address not mapped (1)

[pandora:19256] Failing at address: 0x7f911c69fff0

[pandora:19255] *** Process received signal ***

[pandora:19255] Signal: Segmentation fault (11)

[pandora:19255] Signal code: Address not mapped (1)

[pandora:19255] Failing at address: 0x7ff09cd3fff0

[pandora:19256] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f913467f680]

[pandora:19256] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f91343ec4a0]

[pandora:19256] [ 2] 
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f9133d1be55]

[pandora:19256] [ 3] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f913493798b]

[pandora:19256] [ 4] [pandora:19255] [ 0] 
/usr/lib64/libpthread.so.0(+0xf680)[0x7ff0b4d27680]

[pandora:19255] [ 1] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f913490eda7]

[pandora:19256] [ 5] IMB-MPI1[0x40b83b]

[pandora:19256] [ 6] IMB-MPI1[0x407155]

[pandora:19256] [ 7] IMB-MPI1[0x4022ea]

[pandora:19256] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7ff0b4a944a0]

[pandora:19255] [ 2] 
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f91342c23d5]

[pandora:19256] [ 9] IMB-MPI1[0x401d49]

[pandora:19256] *** End of error message ***

/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7ff0b43c3e55]

[pandora:19255] [ 3] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7ff0b4fdf98b]

[pandora:19255] [ 4] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7ff0b4fb6da7]

[pandora:19255] [ 5] IMB-MPI1[0x40b83b]

[pandora:19255] [ 6] IMB-MPI1[0x407155]

[pandora:19255] [ 7] IMB-MPI1[0x4022ea]

[pandora:19255] [ 8] 
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff0b496a3d5]

[pandora:19255] [ 9] IMB-MPI1[0x401d49]

[pandora:19255] *** End of error message ***

[phoebe:12418] *** Process received signal ***

[phoebe:12418] Signal: Segmentation fault (11)

[phoebe:12418] Signal code: Address not mapped (1)

[phoebe:12418] Failing at address: 0x7f5ce27dfff0

[phoebe:12418] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5cfa767680]

[phoebe:12418] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5cfa4d44a0]

[phoebe:12418] [ 2] 
/opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5cf9e03e55]

[phoebe:12418] [ 3] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5cfaa1f98b]

[phoebe:12418] [ 4] 
/opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5cfa9f6da7]

[phoebe:12418] [ 5] IMB-MPI1[0x40b83b]

[phoebe:12418] [ 6] IMB-MPI1[0x407155]

[phoebe:12418] [ 7] IMB-MPI1[0x4022ea]

[phoebe:12418] [ 8] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5cfa3aa3d5]

[phoebe:12418] [ 9] IMB-MPI1[0x401d49]

[phoebe:12418] *** End of error message ***

--------------------------------------------------------------------------

Primary job  terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.

--------------------------------------------------------------------------

--------------------------------------------------------------------------

mpirun noticed that process rank 0 with PID 0 on node pandora exited on signal 
11 (Segmentation fault).

--------------------------------------------------------------------------

 

- Adam LeBlanc

 

On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users 
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> > wrote:

Can you try the latest 4.0.x nightly snapshot and see if the problem still 
occurs?

    https://www.open-mpi.org/nightly/v4.0.x/


> On Feb 20, 2019, at 1:40 PM, Adam LeBlanc <alebl...@iol.unh.edu 
> <mailto:alebl...@iol.unh.edu> > wrote:
> 
> I do here is the output:
> 
> 2 total processes killed (some possibly by mpirun during cleanup)
> [pandora:12238] *** Process received signal ***
> [pandora:12238] Signal: Segmentation fault (11)
> [pandora:12238] Signal code: Invalid permissions (2)
> [pandora:12238] Failing at address: 0x7f5c8e31fff0
> [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
> [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
> /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
> [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
> [pandora:12237] Signal code: Invalid permissions (2)
> [pandora:12237] Failing at address: 0x7f6c4ab3fff0
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
> [pandora:12238] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
> [pandora:12238] [ 4] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
> [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
> [pandora:12238] [ 6] IMB-MPI1[0x407155]
> [pandora:12238] [ 7] IMB-MPI1[0x4022ea]
> [pandora:12238] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
> [pandora:12238] [ 9] IMB-MPI1[0x401d49]
> [pandora:12238] *** End of error message ***
> [pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
> [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
> [pandora:12237] [ 2] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
> [pandora:12237] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
> [pandora:12237] [ 4] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
> [pandora:12237] [ 5] IMB-MPI1[0x40b83b]
> [pandora:12237] [ 6] IMB-MPI1[0x407155]
> [pandora:12237] [ 7] IMB-MPI1[0x4022ea]
> [pandora:12237] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
> [pandora:12237] [ 9] IMB-MPI1[0x401d49]
> [pandora:12237] *** End of error message ***
> [phoebe:07408] *** Process received signal ***
> [phoebe:07408] Signal: Segmentation fault (11)
> [phoebe:07408] Signal code: Invalid permissions (2)
> [phoebe:07408] Failing at address: 0x7f6b9ca9fff0
> [titan:07169] *** Process received signal ***
> [titan:07169] Signal: Segmentation fault (11)
> [titan:07169] Signal code: Invalid permissions (2)
> [titan:07169] Failing at address: 0x7fc01295fff0
> [phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
> [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
> [phoebe:07408] [ 2] [titan:07169] [ 0] 
> /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
> [titan:07169] [ 1] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
> [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
> [titan:07169] [ 2] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
> [phoebe:07408] [ 4] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
> [titan:07169] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
> [phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:07408] [ 6] IMB-MPI1[0x407155]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
> [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:07408] [ 8] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
> [titan:07169] [ 5] IMB-MPI1[0x40b83b]
> [titan:07169] [ 6] IMB-MPI1[0x407155]
> [titan:07169] [ 7] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
> [phoebe:07408] [ 9] IMB-MPI1[0x401d49]
> [phoebe:07408] *** End of error message ***
> IMB-MPI1[0x4022ea]
> [titan:07169] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fc025d5a3d5]
> [titan:07169] [ 9] IMB-MPI1[0x401d49]
> [titan:07169] *** End of error message ***
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node pandora exited on 
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> 
> 
> - Adam LeBlanc
> 
> On Wed, Feb 20, 2019 at 1:20 PM Howard Pritchard <hpprit...@gmail.com 
> <mailto:hpprit...@gmail.com> > wrote:
> HI Adam,
> 
> As a sanity check, if you try to use --mca btl self,vader,tcp
> 
> do you still see the segmentation fault?
> 
> Howard
> 
> 
> Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <alebl...@iol.unh.edu 
> <mailto:alebl...@iol.unh.edu> >:
> Hello,
> 
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command: mpirun 
> --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca 
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca 
> btl_openib_allow_ib 1 -np 6
>  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
> 
> I get this error:
> 
> #----------------------------------------------------------------
> # Benchmarking Reduce_scatter 
> # #processes = 4 
> # ( 2 additional processes waiting in MPI_Barrier)
> #----------------------------------------------------------------
>        #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>             0         1000         0.14         0.15         0.14
>             4         1000         5.00         7.58         6.28
>             8         1000         5.13         7.68         6.41
>            16         1000         5.05         7.74         6.39
>            32         1000         5.43         7.96         6.75
>            64         1000         6.78         8.56         7.69
>           128         1000         7.77         9.55         8.59
>           256         1000         8.28        10.96         9.66
>           512         1000         9.19        12.49        10.85
>          1024         1000        11.78        15.01        13.38
>          2048         1000        17.41        19.51        18.52
>          4096         1000        25.73        28.22        26.89
>          8192         1000        47.75        49.44        48.79
>         16384         1000        81.10        90.15        84.75
>         32768         1000       163.01       178.58       173.19
>         65536          640       315.63       340.51       333.18
>        131072          320       475.48       528.82       510.85
>        262144          160       979.70      1063.81      1035.61
>        524288           80      2070.51      2242.58      2150.15
>       1048576           40      4177.36      4527.25      4431.65
>       2097152           20      8738.08      9340.50      9147.89
> [pandora:04500] *** Process received signal ***
> [pandora:04500] Signal: Segmentation fault (11)
> [pandora:04500] Signal code: Address not mapped (1)
> [pandora:04500] Failing at address: 0x7f310ebffff0
> [pandora:04499] *** Process received signal ***
> [pandora:04499] Signal: Segmentation fault (11)
> [pandora:04499] Signal code: Address not mapped (1)
> [pandora:04499] Failing at address: 0x7f28b11ffff0
> [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
> [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
> [pandora:04500] [ 2] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
> [pandora:04500] [ 3] [pandora:04499] [ 0] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
> [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
> [pandora:04499] [ 1] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
> [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04500] [ 6] IMB-MPI1[0x407155]
> [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
> [pandora:04499] [ 2] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
> [pandora:04500] [ 9] IMB-MPI1[0x401d49]
> [pandora:04500] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
> [pandora:04499] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
> [pandora:04499] [ 4] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
> [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04499] [ 6] IMB-MPI1[0x407155]
> [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04499] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
> [pandora:04499] [ 9] IMB-MPI1[0x401d49]
> [pandora:04499] *** End of error message ***
> [phoebe:03779] *** Process received signal ***
> [phoebe:03779] Signal: Segmentation fault (11)
> [phoebe:03779] Signal code: Address not mapped (1)
> [phoebe:03779] Failing at address: 0x7f483d6ffff0
> [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
> [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
> [phoebe:03779] [ 2] 
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
> [phoebe:03779] [ 3] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
> [phoebe:03779] [ 4] 
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
> [phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:03779] [ 6] IMB-MPI1[0x407155]
> [phoebe:03779] [ 7] IMB-MPI1[0x4022ea]
> [phoebe:03779] [ 8] 
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f485530a3d5]
> [phoebe:03779] [ 9] IMB-MPI1[0x401d49]
> [phoebe:03779] *** End of error message ***
> --------------------------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 3779 on node phoebe-ib exited on 
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> 
> Also if I reinstall 3.1.2 I do not have this issue at all.
> 
> Any thoughts on what could be the issue?
> 
> Thanks,
> Adam LeBlanc
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com <mailto:jsquy...@cisco.com> 

_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> 
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to