I am using Open-MPI 1.4.3 compiled with gcc 4.5.3.

The library:

/usr/lib64/libpsm_infinipath.so.1.14: ELF 64-bit LSB shared object, AMD x86-64, version 1 (SYSV), not stripped



Jeff Squyres a écrit :
Yes, PSM is the native transport for InfiniPath.  It is faster than the 
InfiniBand verbs support on the same hardware.

What version of Open MPI are you using?


On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:

Hello,

I am getting random crashes (segmentation faults) on a super computer 
(guillimin)
using 3 nodes with 12 cores per node. The same program (Ray) runs without any
problem on the other super computers I use.

The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and
the messages transit using "performance scaled messaging" (PSM) which I think 
is some
sort of replacement to Infiniband verbs although I am not sure.

Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
the problem, but increases the latency from 20 microseconds to 55 microseconds.

There seems to be some sort of message corruption during the transit, but I can 
not rule out
other explanations.


I have no idea what is going on and why disabling PSM solves the problem.


Versions

module load gcc/4.5.3
module load openmpi/1.4.3-gcc


Command that randomly crashes

mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
Ray -k 31 \
-o MiSeq-bug-2012-06-28.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq


Command that completes successfully

mpiexec -n 36 -output-filename  psm-bug-2012-06-26-hotfix.1 \
--mca mtl ^psm \
Ray -k 31 \
-o psm-bug-2012-06-26-hotfix.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq



Sébastien Boisvert
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to