Hi,

If some mx devices are found, the logic is not only to use the mx BTL but also 
to use the mx MTL. You can try to disable this with --mca mtl ob1. 

Aurelien




Le 11 juin 2012 à 18:24, Yong Qin a écrit :

> Hi,
> 
> We are migrating to Open MPI 1.6 but since 1.6 dropped support for
> Myricom GM driver so we have to switch to the MX driver. We have the
> Myricom MX2G 1.2.16 driver installed. However upon testing the new
> build of Open MPI on a node without the actual Myrinet device, we are
> getting the following segmentation fault.
> 
> <---->
> [yqin@n0007.scs00 ~]$ mpirun -np 2  -np 2 osu_bw
> [n0007.scs00:03075] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> [n0007.scs00:03074] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> --------------------------------------------------------------------------
> [[32626,1],0]: A high-performance Open MPI point-to-point messaging module
> was unable to find any relevant network interfaces:
> 
> Module: Myrinet/MX
>  Host: n0007.scs00
> 
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> [n0007:03074] *** Process received signal ***
> [n0007:03074] Signal: Segmentation fault (11)
> [n0007:03074] Signal code: Invalid permissions (2)
> [n0007:03074] Failing at address: 0x2b9112128130
> [n0007:03075] *** Process received signal ***
> [n0007:03075] Signal: Segmentation fault (11)
> [n0007:03075] Signal code: Invalid permissions (2)
> [n0007:03075] Failing at address: 0x2b041c9f1130
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 3075 on node n0007.scs00
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> [n0007.scs00:03073] 1 more process has sent help message
> help-mpi-btl-base.txt / btl:no-nics
> [n0007.scs00:03073] Set MCA parameter "orte_base_help_aggregate" to 0
> to see all help / error messages
> <---->
> 
> Excluding the MX BTL does not get anywhere further.
> 
> <---->
> [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl ^mx -np 2 osu_bw
> [n0007.scs00:03453] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> [n0007.scs00:03454] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> [n0007:03453] *** Process received signal ***
> [n0007:03453] Signal: Segmentation fault (11)
> [n0007:03453] Signal code: Address not mapped (1)
> [n0007:03453] Failing at address: 0x2b3c1fe73130
> [n0007:03454] *** Process received signal ***
> [n0007:03454] Signal: Segmentation fault (11)
> [n0007:03454] Signal code: Address not mapped (1)
> [n0007:03454] Failing at address: 0x2b2431bf0130
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 3454 on node n0007.scs00
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> <---->
> 
> If we use only designated BTL such as SM and SELF, the binary runs but
> still getting segmentation fault towards the end.
> 
> <---->
> [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl sm,self -np 2 osu_bw
> [n0007.scs00:03460] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> [n0007.scs00:03461] Error in mx_open_endpoint (error No MX device
> entry in /dev.)
> # OSU MPI Bandwidth Test v3.3
> # Size        Bandwidth (MB/s)
> 1                         2.54
> 2                         5.22
> 4                        10.92
> 8                        21.61
> 16                       43.89
> 32                       62.19
> 64                      121.95
> 128                     212.28
> 256                     337.52
> 512                     516.67
> 1024                    701.29
> 2048                    845.69
> 4096                    836.45
> 8192                    934.31
> 16384                  1035.53
> 32768                  1186.90
> 65536                  1390.41
> 131072                 1519.14
> 262144                 1562.96
> 524288                 1596.78
> 1048576                1611.48
> 2097152                1616.09
> 4194304                1620.47
> [n0007:03461] *** Process received signal ***
> [n0007:03460] *** Process received signal ***
> [n0007:03460] Signal: Segmentation fault (11)
> [n0007:03460] Signal code: Address not mapped (1)
> [n0007:03460] Failing at address: 0x2acac044d130
> [n0007:03461] Signal: Segmentation fault (11)
> [n0007:03461] Signal code: Address not mapped (1)
> [n0007:03461] Failing at address: 0x2b8bc4121130
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 3460 on node n0007.scs00
> exited on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> <---->
> 
> 
> Can anybody shed some light here? It looks like ompi is trying to open
> the MX device no matter what. This is on a fresh build of Open MPI 1.6
> with "--with-mx --with-openib" options. We didn't have such an issue
> with the old GM BTL.
> 
> Thanks,
> 
> Yong Qin
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
* Dr. Aurélien Bouteiller
* Researcher at Innovative Computing Laboratory
* University of Tennessee
* 1122 Volunteer Boulevard, suite 309b
* Knoxville, TN 37996
* 865 974 9375







Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to