Hi I have been testing OpenMPI 1.2, and now 1.2.1, on several BProc- based clusters, and I have found some problems/issues. All my clusters have standard ethernet interconnects, either 100Base/T or Gigabit, on standard switches.
The clusters are all running Clustermatic 5 (BProc 4.x), and range from 32-bit Athlon, to 32-bit Xeon, to 64-bit Opteron. In all cases the same problems occur, identically. I attach here the results from "ompi_info --all" and the config.log, for my latest build on an Opteron cluster, using the Pathscale compilers. I had exactly the same problems when using the vanilla GNU compilers. Now for a description of the problem: When running an mpi code (cpi.c, from the standard mpi examples, also attached), using the mpirun defaults (e.g. -byslot), with a single process: sonoma:dgruner{134}> mpirun -n 1 ./cpip [n17:30019] odls_bproc: openpty failed, using pipes instead Process 0 on n17 pi is approximately 3.1415926544231341, Error is 0.0000000008333410 wall clock time = 0.000199 However, if one tries to run more than one process, this bombs: sonoma:dgruner{134}> mpirun -n 2 ./cpip . . . [n21:30029] OOB: Connection to HNP lost [n21:30029] OOB: Connection to HNP lost [n21:30029] OOB: Connection to HNP lost [n21:30029] OOB: Connection to HNP lost [n21:30029] OOB: Connection to HNP lost [n21:30029] OOB: Connection to HNP lost . . ad infinitum If one uses de option "-bynode", things work: sonoma:dgruner{145}> mpirun -bynode -n 2 ./cpip [n17:30055] odls_bproc: openpty failed, using pipes instead Process 0 on n17 Process 1 on n21 pi is approximately 3.1415926544231318, Error is 0.0000000008333387 wall clock time = 0.010375 Note that there is always the message about "openpty failed, using pipes instead". If I run more processes (on my 3-node cluster, with 2 cpus per node), the openpty message appears repeatedly for the first node: sonoma:dgruner{146}> mpirun -bynode -n 6 ./cpip [n17:30061] odls_bproc: openpty failed, using pipes instead [n17:30061] odls_bproc: openpty failed, using pipes instead Process 0 on n17 Process 2 on n49 Process 1 on n21 Process 5 on n49 Process 3 on n17 Process 4 on n21 pi is approximately 3.1415926544231239, Error is 0.0000000008333307 wall clock time = 0.050332 Should I worry about the openpty failure? I suspect that communications may be slower this way. Using the -byslot option always fails, so this is a bug. The same occurs for all the codes that I have tried, both simple and complex. Thanks for your attention to this. Regards, Daniel -- Dr. Daniel Gruner dgru...@chem.utoronto.ca Dept. of Chemistry daniel.gru...@utoronto.ca University of Toronto phone: (416)-978-8689 80 St. George Street fax: (416)-978-5325 Toronto, ON M5S 3H6, Canada finger for PGP public key
cpi.c.gz
Description: GNU Zip compressed data
config.log.gz
Description: GNU Zip compressed data
ompiinfo.gz
Description: GNU Zip compressed data