ah, I guess my original understanding of PML was wrong. Adding "-mca pml ob1" does help to ease the problem. But the question still remains. Why ompi decided to use the mx BTL in the first place, given there's no physical device onboard at all? This behavior is completely different than the original gm BTL.
Thanks, Yong Qin On Mon, Jun 11, 2012 at 3:59 PM, Aurélien Bouteiller <boute...@eecs.utk.edu> wrote: > > Le 11 juin 2012 à 18:57, Aurélien Bouteiller a écrit : > >> Hi, >> >> If some mx devices are found, the logic is not only to use the mx BTL but >> also to use the mx MTL. You can try to disable this with --mca mtl ob1. >> > Sorry, I meant --mca pml ob1 > >> Aurelien >> >> >> >> >> Le 11 juin 2012 à 18:24, Yong Qin a écrit : >> >>> Hi, >>> >>> We are migrating to Open MPI 1.6 but since 1.6 dropped support for >>> Myricom GM driver so we have to switch to the MX driver. We have the >>> Myricom MX2G 1.2.16 driver installed. However upon testing the new >>> build of Open MPI on a node without the actual Myrinet device, we are >>> getting the following segmentation fault. >>> >>> <----> >>> [yqin@n0007.scs00 ~]$ mpirun -np 2 -np 2 osu_bw >>> [n0007.scs00:03075] Error in mx_open_endpoint (error No MX device >>> entry in /dev.) >>> [n0007.scs00:03074] Error in mx_open_endpoint (error No MX device >>> entry in /dev.) >>> -------------------------------------------------------------------------- >>> [[32626,1],0]: A high-performance Open MPI point-to-point messaging module >>> was unable to find any relevant network interfaces: >>> >>> Module: Myrinet/MX >>> Host: n0007.scs00 >>> >>> Another transport will be used instead, although this may result in >>> lower performance. >>> -------------------------------------------------------------------------- >>> [n0007:03074] *** Process received signal *** >>> [n0007:03074] Signal: Segmentation fault (11) >>> [n0007:03074] Signal code: Invalid permissions (2) >>> [n0007:03074] Failing at address: 0x2b9112128130 >>> [n0007:03075] *** Process received signal *** >>> [n0007:03075] Signal: Segmentation fault (11) >>> [n0007:03075] Signal code: Invalid permissions (2) >>> [n0007:03075] Failing at address: 0x2b041c9f1130 >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 1 with PID 3075 on node n0007.scs00 >>> exited on signal 11 (Segmentation fault). >>> -------------------------------------------------------------------------- >>> [n0007.scs00:03073] 1 more process has sent help message >>> help-mpi-btl-base.txt / btl:no-nics >>> [n0007.scs00:03073] Set MCA parameter "orte_base_help_aggregate" to 0 >>> to see all help / error messages >>> <----> >>> >>> Excluding the MX BTL does not get anywhere further. >>> >>> <----> >>> [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl ^mx -np 2 osu_bw >>> [n0007.scs00:03453] Error in mx_open_endpoint (error No MX device >>> entry in /dev.) >>> [n0007.scs00:03454] Error in mx_open_endpoint (error No MX device >>> entry in /dev.) >>> [n0007:03453] *** Process received signal *** >>> [n0007:03453] Signal: Segmentation fault (11) >>> [n0007:03453] Signal code: Address not mapped (1) >>> [n0007:03453] Failing at address: 0x2b3c1fe73130 >>> [n0007:03454] *** Process received signal *** >>> [n0007:03454] Signal: Segmentation fault (11) >>> [n0007:03454] Signal code: Address not mapped (1) >>> [n0007:03454] Failing at address: 0x2b2431bf0130 >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 1 with PID 3454 on node n0007.scs00 >>> exited on signal 11 (Segmentation fault). >>> -------------------------------------------------------------------------- >>> <----> >>> >>> If we use only designated BTL such as SM and SELF, the binary runs but >>> still getting segmentation fault towards the end. >>> >>> <----> >>> [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl sm,self -np 2 osu_bw >>> [n0007.scs00:03460] Error in mx_open_endpoint (error No MX device >>> entry in /dev.) >>> [n0007.scs00:03461] Error in mx_open_endpoint (error No MX device >>> entry in /dev.) >>> # OSU MPI Bandwidth Test v3.3 >>> # Size Bandwidth (MB/s) >>> 1 2.54 >>> 2 5.22 >>> 4 10.92 >>> 8 21.61 >>> 16 43.89 >>> 32 62.19 >>> 64 121.95 >>> 128 212.28 >>> 256 337.52 >>> 512 516.67 >>> 1024 701.29 >>> 2048 845.69 >>> 4096 836.45 >>> 8192 934.31 >>> 16384 1035.53 >>> 32768 1186.90 >>> 65536 1390.41 >>> 131072 1519.14 >>> 262144 1562.96 >>> 524288 1596.78 >>> 1048576 1611.48 >>> 2097152 1616.09 >>> 4194304 1620.47 >>> [n0007:03461] *** Process received signal *** >>> [n0007:03460] *** Process received signal *** >>> [n0007:03460] Signal: Segmentation fault (11) >>> [n0007:03460] Signal code: Address not mapped (1) >>> [n0007:03460] Failing at address: 0x2acac044d130 >>> [n0007:03461] Signal: Segmentation fault (11) >>> [n0007:03461] Signal code: Address not mapped (1) >>> [n0007:03461] Failing at address: 0x2b8bc4121130 >>> -------------------------------------------------------------------------- >>> mpirun noticed that process rank 0 with PID 3460 on node n0007.scs00 >>> exited on signal 11 (Segmentation fault). >>> -------------------------------------------------------------------------- >>> <----> >>> >>> >>> Can anybody shed some light here? It looks like ompi is trying to open >>> the MX device no matter what. This is on a fresh build of Open MPI 1.6 >>> with "--with-mx --with-openib" options. We didn't have such an issue >>> with the old GM BTL. >>> >>> Thanks, >>> >>> Yong Qin >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> -- >> * Dr. Aurélien Bouteiller >> * Researcher at Innovative Computing Laboratory >> * University of Tennessee >> * 1122 Volunteer Boulevard, suite 309b >> * Knoxville, TN 37996 >> * 865 974 9375 >> >> >> >> >> >> >> > > -- > * Dr. Aurélien Bouteiller > * Researcher at Innovative Computing Laboratory > * University of Tennessee > * 1122 Volunteer Boulevard, suite 309b > * Knoxville, TN 37996 > * 865 974 9375 > > > > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users