ah, I guess my original understanding of PML was wrong. Adding "-mca
pml ob1" does help to ease the problem. But the question still
remains. Why ompi decided to use the mx BTL in the first place, given
there's no physical device onboard at all? This behavior is completely
different than the original gm BTL.

Thanks,

Yong Qin

On Mon, Jun 11, 2012 at 3:59 PM, Aurélien Bouteiller
<boute...@eecs.utk.edu> wrote:
>
> Le 11 juin 2012 à 18:57, Aurélien Bouteiller a écrit :
>
>> Hi,
>>
>> If some mx devices are found, the logic is not only to use the mx BTL but 
>> also to use the mx MTL. You can try to disable this with --mca mtl ob1.
>>
> Sorry, I meant --mca pml ob1
>
>> Aurelien
>>
>>
>>
>>
>> Le 11 juin 2012 à 18:24, Yong Qin a écrit :
>>
>>> Hi,
>>>
>>> We are migrating to Open MPI 1.6 but since 1.6 dropped support for
>>> Myricom GM driver so we have to switch to the MX driver. We have the
>>> Myricom MX2G 1.2.16 driver installed. However upon testing the new
>>> build of Open MPI on a node without the actual Myrinet device, we are
>>> getting the following segmentation fault.
>>>
>>> <---->
>>> [yqin@n0007.scs00 ~]$ mpirun -np 2  -np 2 osu_bw
>>> [n0007.scs00:03075] Error in mx_open_endpoint (error No MX device
>>> entry in /dev.)
>>> [n0007.scs00:03074] Error in mx_open_endpoint (error No MX device
>>> entry in /dev.)
>>> --------------------------------------------------------------------------
>>> [[32626,1],0]: A high-performance Open MPI point-to-point messaging module
>>> was unable to find any relevant network interfaces:
>>>
>>> Module: Myrinet/MX
>>> Host: n0007.scs00
>>>
>>> Another transport will be used instead, although this may result in
>>> lower performance.
>>> --------------------------------------------------------------------------
>>> [n0007:03074] *** Process received signal ***
>>> [n0007:03074] Signal: Segmentation fault (11)
>>> [n0007:03074] Signal code: Invalid permissions (2)
>>> [n0007:03074] Failing at address: 0x2b9112128130
>>> [n0007:03075] *** Process received signal ***
>>> [n0007:03075] Signal: Segmentation fault (11)
>>> [n0007:03075] Signal code: Invalid permissions (2)
>>> [n0007:03075] Failing at address: 0x2b041c9f1130
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 3075 on node n0007.scs00
>>> exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>> [n0007.scs00:03073] 1 more process has sent help message
>>> help-mpi-btl-base.txt / btl:no-nics
>>> [n0007.scs00:03073] Set MCA parameter "orte_base_help_aggregate" to 0
>>> to see all help / error messages
>>> <---->
>>>
>>> Excluding the MX BTL does not get anywhere further.
>>>
>>> <---->
>>> [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl ^mx -np 2 osu_bw
>>> [n0007.scs00:03453] Error in mx_open_endpoint (error No MX device
>>> entry in /dev.)
>>> [n0007.scs00:03454] Error in mx_open_endpoint (error No MX device
>>> entry in /dev.)
>>> [n0007:03453] *** Process received signal ***
>>> [n0007:03453] Signal: Segmentation fault (11)
>>> [n0007:03453] Signal code: Address not mapped (1)
>>> [n0007:03453] Failing at address: 0x2b3c1fe73130
>>> [n0007:03454] *** Process received signal ***
>>> [n0007:03454] Signal: Segmentation fault (11)
>>> [n0007:03454] Signal code: Address not mapped (1)
>>> [n0007:03454] Failing at address: 0x2b2431bf0130
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 1 with PID 3454 on node n0007.scs00
>>> exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>> <---->
>>>
>>> If we use only designated BTL such as SM and SELF, the binary runs but
>>> still getting segmentation fault towards the end.
>>>
>>> <---->
>>> [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl sm,self -np 2 osu_bw
>>> [n0007.scs00:03460] Error in mx_open_endpoint (error No MX device
>>> entry in /dev.)
>>> [n0007.scs00:03461] Error in mx_open_endpoint (error No MX device
>>> entry in /dev.)
>>> # OSU MPI Bandwidth Test v3.3
>>> # Size        Bandwidth (MB/s)
>>> 1                         2.54
>>> 2                         5.22
>>> 4                        10.92
>>> 8                        21.61
>>> 16                       43.89
>>> 32                       62.19
>>> 64                      121.95
>>> 128                     212.28
>>> 256                     337.52
>>> 512                     516.67
>>> 1024                    701.29
>>> 2048                    845.69
>>> 4096                    836.45
>>> 8192                    934.31
>>> 16384                  1035.53
>>> 32768                  1186.90
>>> 65536                  1390.41
>>> 131072                 1519.14
>>> 262144                 1562.96
>>> 524288                 1596.78
>>> 1048576                1611.48
>>> 2097152                1616.09
>>> 4194304                1620.47
>>> [n0007:03461] *** Process received signal ***
>>> [n0007:03460] *** Process received signal ***
>>> [n0007:03460] Signal: Segmentation fault (11)
>>> [n0007:03460] Signal code: Address not mapped (1)
>>> [n0007:03460] Failing at address: 0x2acac044d130
>>> [n0007:03461] Signal: Segmentation fault (11)
>>> [n0007:03461] Signal code: Address not mapped (1)
>>> [n0007:03461] Failing at address: 0x2b8bc4121130
>>> --------------------------------------------------------------------------
>>> mpirun noticed that process rank 0 with PID 3460 on node n0007.scs00
>>> exited on signal 11 (Segmentation fault).
>>> --------------------------------------------------------------------------
>>> <---->
>>>
>>>
>>> Can anybody shed some light here? It looks like ompi is trying to open
>>> the MX device no matter what. This is on a fresh build of Open MPI 1.6
>>> with "--with-mx --with-openib" options. We didn't have such an issue
>>> with the old GM BTL.
>>>
>>> Thanks,
>>>
>>> Yong Qin
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> --
>> * Dr. Aurélien Bouteiller
>> * Researcher at Innovative Computing Laboratory
>> * University of Tennessee
>> * 1122 Volunteer Boulevard, suite 309b
>> * Knoxville, TN 37996
>> * 865 974 9375
>>
>>
>>
>>
>>
>>
>>
>
> --
> * Dr. Aurélien Bouteiller
> * Researcher at Innovative Computing Laboratory
> * University of Tennessee
> * 1122 Volunteer Boulevard, suite 309b
> * Knoxville, TN 37996
> * 865 974 9375
>
>
>
>
>
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to