Hello,

Just to give an update on the list:

Today, I implemented message data reliability verification in my code using
the CRC32 algorithm.

Without PSM, everything runs fine.

With PSM, I get these errors:

Error: RayPlatform detected a message corruption !
 Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE_REPLY
 Source: 3
 Destination: 3
 sizeof(MessageUnit): 8
 Count (excluding checksum): 1
 Expected checksum (CRC32): ea
 Actual checksum (CRC32): 4f3b6143

Error: RayPlatform detected a message corruption !
 Tag: RAY_MPI_TAG_GET_READ_MATE
 Source: 4
 Destination: 4
 sizeof(MessageUnit): 8
 Count (excluding checksum): 1
 Expected checksum (CRC32): f4240
 Actual checksum (CRC32): 0

Error: RayPlatform detected a message corruption !
 Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE_REPLY
 Source: 5
 Destination: 5
 sizeof(MessageUnit): 8
 Count (excluding checksum): 7
 Expected checksum (CRC32): dd94edd5

Error: RayPlatform detected a message corruption !
 Tag: RAY_MPI_TAG_GET_VERTEX_EDGES_COMPACT
 Source: 5
 Destination: 5
 sizeof(MessageUnit): 8
 Count (excluding checksum): 2
 Expected checksum (CRC32): e80f2c45
 Actual checksum (CRC32): 0

Error: RayPlatform detected a message corruption !
 Tag: RAY_MPI_TAG_GET_VERTEX_EDGES_COMPACT_REPLY
 Source: 5
 Destination: 5
 sizeof(MessageUnit): 8
 Count (excluding checksum): 2
 Expected checksum (CRC32): 42
 Actual checksum (CRC32): a906f61

Error: RayPlatform detected a message corruption !
 Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE
 Source: 12
 Destination: 12
 sizeof(MessageUnit): 8
 Count (excluding checksum): 3
 Expected checksum (CRC32): 5b6f1504
 Actual checksum (CRC32): d5b3049a

Error: RayPlatform detected a message corruption !
 Tag: RAY_MPI_TAG_REQUEST_VERTEX_READS
 Source: 27
 Destination: 27
 sizeof(MessageUnit): 8
 Count (excluding checksum): 5
 Expected checksum (CRC32): fc01eda4
 Actual checksum (CRC32): 0


I guess this is when the Open-MPI PML (point-to-point messaging layer)
dr (data reliability) would be helpful.


I now have a open case with the QLogic support.


Thank you for your help.


Jeff Squyres a écrit :
Yes, PSM is the native transport for InfiniPath.  It is faster than the 
InfiniBand verbs support on the same hardware.

What version of Open MPI are you using?


On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:

Hello,

I am getting random crashes (segmentation faults) on a super computer 
(guillimin)
using 3 nodes with 12 cores per node. The same program (Ray) runs without any
problem on the other super computers I use.

The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and
the messages transit using "performance scaled messaging" (PSM) which I think 
is some
sort of replacement to Infiniband verbs although I am not sure.

Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
the problem, but increases the latency from 20 microseconds to 55 microseconds.

There seems to be some sort of message corruption during the transit, but I can 
not rule out
other explanations.


I have no idea what is going on and why disabling PSM solves the problem.


Versions

module load gcc/4.5.3
module load openmpi/1.4.3-gcc


Command that randomly crashes

mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
Ray -k 31 \
-o MiSeq-bug-2012-06-28.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq


Command that completes successfully

mpiexec -n 36 -output-filename  psm-bug-2012-06-26-hotfix.1 \
--mca mtl ^psm \
Ray -k 31 \
-o psm-bug-2012-06-26-hotfix.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq



Sébastien Boisvert
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to