Hi all, We have used the Open MPI distributions up to the 1.8.7 version without any problem in a small LINUX cluster built with diskless nodes (x86_64, Fedora 17, Linux version 4.1.1 (gcc version 4.7.2 20120921 (Red Hat 4.7.2-2) (GCC))).
However, from the 1.8.8 version, we have a problem with the mpirun command. For instance, with the 1.10.0 Open MPI version, we can launch MPI jobs across multiple node hosts and server sucesfully only if they are launched from any node but not from the server. In order to fix, following the hints given in http://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems we have tried a simple test: [jdelia@coyote ~]$ which mpirun /usr/beta/openmpi/bin/mpirun [jdelia@coyote ~]$ mpirun --version mpirun (Open MPI) 1.10.0 [jdelia@coyote ~]$ hostname coyote [jdelia@coyote ~]$ ssh node1 [jdelia@node1 ~]$ mpirun --host coyote hostname coyote [jdelia@node1 ~]$ exit logout Connection to node1 closed. [jdelia@coyote ~]$ mpirun --host node1 hostname [node1:17861] [[8026,0],1] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: ... snip ... -------------------------------------------------------------------------- The PATH and LD_LIBRARY_PATH in coyote (server) and node1 were reduced to [jdelia@coyote ]$ ssh coyote env | grep -i PATH LD_LIBRARY_PATH=/usr/beta/openmpi/lib:/usr/beta/gcc-trunk/lib:/usr/beta/gcc-trunk/lib64:/usr/lib:/usr/lib64:/usr/local/lib:/usr/local/lib64 PATH=.:/usr/beta/openmpi/bin:/usr/beta/gcc-trunk/bin:/usr/lib64/ccache:/usr/bin:/usr/sbin/usr/local/bin:/usr/local/sbin MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles QT_PLUGIN_PATH=/usr/lib64/kde4/plugins:/usr/lib/kde4/plugins [jdelia@coyote ]$ ssh node1 env | grep -i PATH LD_LIBRARY_PATH=/usr/beta/openmpi/lib:/usr/beta/gcc-trunk/lib:/usr/beta/gcc-trunk/lib64:/usr/lib:/usr/lib64:/usr/local/lib:/usr/local/lib64 PATH=.:/usr/beta/openmpi/bin:/usr/beta/gcc-trunk/bin:/usr/lib64/ccache:/usr/bin:/usr/sbin/usr/local/bin:/usr/local/sbin MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles Until the 1.8.7 version these tests were all OK. Then, several openmpi distributions were rebuilt using the gcc compilers, both with the system version gcc (GCC) 4.7.2 20120921 (Red Hat 4.7.2-2) as with the experimental one $ gcc --version gcc (GCC) 6.0.0 20150919 (experimental) but without luck. Again, if we go back to 1.8.7 version, and using the same environment variables, all tests are OK. Please, any clue in order to fix this trouble? We try to attach the configure log files of the 1.8.7 and 1.8.10 versions using the beta gcc compiler. Thanks in advance. Regards, Jorge. -- CIMEC (UNL-CONICET), http://www.cimec.org.ar/ Predio CONICET-Santa Fe, Colec. Ruta Nac. 168, Paraje El Pozo, S3000GLN, Santa Fe, ARGENTINA Univ Nac Litoral (UNL). Cons Nac Inv Científ y Técn (CONICET)
make-logs.tgz
Description: application/compressed-tar