Hi,
Thank you for the direction.
I installed Open-MPI 1.6 and the program is also crashing with 1.6.
Could there be a bug in my code ?
I don't see how disabling PSM would make the bug go away if the bug
is in my code.
Open-MPI configure command
module load gcc/4.5.3
./configure \
--prefix=/sb/project/nne-790-ab/software/Open-MPI/1.6/Build \
--with-openib \
--with-psm \
--with-tm=/software/tools/torque/ \
| tee configure.log
Versions
module load gcc/4.5.3
module load /sb/project/nne-790-ab/software/modulefiles/mpi/Open-MPI/1.6
module load /sb/project/nne-790-ab/software/modulefiles/apps/ray/2.0.0
PSM parameters
guillimin> ompi_info -a|grep psm
MCA mtl: psm (MCA v2.0, API v2.0, Component v1.6)
MCA mtl: parameter "mtl_psm_connect_timeout" (current
value: <180>, data source: default value)
MCA mtl: parameter "mtl_psm_debug" (current value:
<1>, data source: default value)
MCA mtl: parameter "mtl_psm_ib_unit" (current value:
<-1>, data source: default value)
MCA mtl: parameter "mtl_psm_ib_port" (current value:
<0>, data source: default value)
MCA mtl: parameter "mtl_psm_ib_service_level" (current
value: <0>, data source: default value)
MCA mtl: parameter "mtl_psm_ib_pkey" (current value:
<32767>, data source: default value)
MCA mtl: parameter "mtl_psm_ib_service_id" (current
value: <0x1000117500000000>, data source: default value)
MCA mtl: parameter "mtl_psm_path_query" (current
value: <none>, data source: default value)
MCA mtl: parameter "mtl_psm_priority" (current value:
<0>, data source: default value)
Thank you.
Sébastien Boisvert
Jeff Squyres a écrit :
The Open MPI 1.4 series is now deprecated. Can you upgrade to Open MPI 1.6?
On Jun 29, 2012, at 9:02 AM, Sébastien Boisvert wrote:
I am using Open-MPI 1.4.3 compiled with gcc 4.5.3.
The library:
/usr/lib64/libpsm_infinipath.so.1.14: ELF 64-bit LSB shared object, AMD x86-64,
version 1 (SYSV), not stripped
Jeff Squyres a écrit :
Yes, PSM is the native transport for InfiniPath. It is faster than the
InfiniBand verbs support on the same hardware.
What version of Open MPI are you using?
On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:
Hello,
I am getting random crashes (segmentation faults) on a super computer
(guillimin)
using 3 nodes with 12 cores per node. The same program (Ray) runs without any
problem on the other super computers I use.
The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and
the messages transit using "performance scaled messaging" (PSM) which I think
is some
sort of replacement to Infiniband verbs although I am not sure.
Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
the problem, but increases the latency from 20 microseconds to 55 microseconds.
There seems to be some sort of message corruption during the transit, but I can
not rule out
other explanations.
I have no idea what is going on and why disabling PSM solves the problem.
Versions
module load gcc/4.5.3
module load openmpi/1.4.3-gcc
Command that randomly crashes
mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
Ray -k 31 \
-o MiSeq-bug-2012-06-28.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
Command that completes successfully
mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
--mca mtl ^psm \
Ray -k 31 \
-o psm-bug-2012-06-26-hotfix.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
Sébastien Boisvert
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users