Hi, If some mx devices are found, the logic is not only to use the mx BTL but also to use the mx MTL. You can try to disable this with --mca mtl ob1.
Aurelien Le 11 juin 2012 à 18:24, Yong Qin a écrit : > Hi, > > We are migrating to Open MPI 1.6 but since 1.6 dropped support for > Myricom GM driver so we have to switch to the MX driver. We have the > Myricom MX2G 1.2.16 driver installed. However upon testing the new > build of Open MPI on a node without the actual Myrinet device, we are > getting the following segmentation fault. > > <----> > [yqin@n0007.scs00 ~]$ mpirun -np 2 -np 2 osu_bw > [n0007.scs00:03075] Error in mx_open_endpoint (error No MX device > entry in /dev.) > [n0007.scs00:03074] Error in mx_open_endpoint (error No MX device > entry in /dev.) > -------------------------------------------------------------------------- > [[32626,1],0]: A high-performance Open MPI point-to-point messaging module > was unable to find any relevant network interfaces: > > Module: Myrinet/MX > Host: n0007.scs00 > > Another transport will be used instead, although this may result in > lower performance. > -------------------------------------------------------------------------- > [n0007:03074] *** Process received signal *** > [n0007:03074] Signal: Segmentation fault (11) > [n0007:03074] Signal code: Invalid permissions (2) > [n0007:03074] Failing at address: 0x2b9112128130 > [n0007:03075] *** Process received signal *** > [n0007:03075] Signal: Segmentation fault (11) > [n0007:03075] Signal code: Invalid permissions (2) > [n0007:03075] Failing at address: 0x2b041c9f1130 > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 3075 on node n0007.scs00 > exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > [n0007.scs00:03073] 1 more process has sent help message > help-mpi-btl-base.txt / btl:no-nics > [n0007.scs00:03073] Set MCA parameter "orte_base_help_aggregate" to 0 > to see all help / error messages > <----> > > Excluding the MX BTL does not get anywhere further. > > <----> > [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl ^mx -np 2 osu_bw > [n0007.scs00:03453] Error in mx_open_endpoint (error No MX device > entry in /dev.) > [n0007.scs00:03454] Error in mx_open_endpoint (error No MX device > entry in /dev.) > [n0007:03453] *** Process received signal *** > [n0007:03453] Signal: Segmentation fault (11) > [n0007:03453] Signal code: Address not mapped (1) > [n0007:03453] Failing at address: 0x2b3c1fe73130 > [n0007:03454] *** Process received signal *** > [n0007:03454] Signal: Segmentation fault (11) > [n0007:03454] Signal code: Address not mapped (1) > [n0007:03454] Failing at address: 0x2b2431bf0130 > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 3454 on node n0007.scs00 > exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > <----> > > If we use only designated BTL such as SM and SELF, the binary runs but > still getting segmentation fault towards the end. > > <----> > [yqin@n0007.scs00 ~]$ mpirun -np 2 -mca btl sm,self -np 2 osu_bw > [n0007.scs00:03460] Error in mx_open_endpoint (error No MX device > entry in /dev.) > [n0007.scs00:03461] Error in mx_open_endpoint (error No MX device > entry in /dev.) > # OSU MPI Bandwidth Test v3.3 > # Size Bandwidth (MB/s) > 1 2.54 > 2 5.22 > 4 10.92 > 8 21.61 > 16 43.89 > 32 62.19 > 64 121.95 > 128 212.28 > 256 337.52 > 512 516.67 > 1024 701.29 > 2048 845.69 > 4096 836.45 > 8192 934.31 > 16384 1035.53 > 32768 1186.90 > 65536 1390.41 > 131072 1519.14 > 262144 1562.96 > 524288 1596.78 > 1048576 1611.48 > 2097152 1616.09 > 4194304 1620.47 > [n0007:03461] *** Process received signal *** > [n0007:03460] *** Process received signal *** > [n0007:03460] Signal: Segmentation fault (11) > [n0007:03460] Signal code: Address not mapped (1) > [n0007:03460] Failing at address: 0x2acac044d130 > [n0007:03461] Signal: Segmentation fault (11) > [n0007:03461] Signal code: Address not mapped (1) > [n0007:03461] Failing at address: 0x2b8bc4121130 > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 3460 on node n0007.scs00 > exited on signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > <----> > > > Can anybody shed some light here? It looks like ompi is trying to open > the MX device no matter what. This is on a fresh build of Open MPI 1.6 > with "--with-mx --with-openib" options. We didn't have such an issue > with the old GM BTL. > > Thanks, > > Yong Qin > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- * Dr. Aurélien Bouteiller * Researcher at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 309b * Knoxville, TN 37996 * 865 974 9375
signature.asc
Description: Message signed with OpenPGP using GPGMail