Hello,
Just to give an update on the list:
Today, I implemented message data reliability verification in my code using
the CRC32 algorithm.
Without PSM, everything runs fine.
With PSM, I get these errors:
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE_REPLY
Source: 3
Destination: 3
sizeof(MessageUnit): 8
Count (excluding checksum): 1
Expected checksum (CRC32): ea
Actual checksum (CRC32): 4f3b6143
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_GET_READ_MATE
Source: 4
Destination: 4
sizeof(MessageUnit): 8
Count (excluding checksum): 1
Expected checksum (CRC32): f4240
Actual checksum (CRC32): 0
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE_REPLY
Source: 5
Destination: 5
sizeof(MessageUnit): 8
Count (excluding checksum): 7
Expected checksum (CRC32): dd94edd5
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_GET_VERTEX_EDGES_COMPACT
Source: 5
Destination: 5
sizeof(MessageUnit): 8
Count (excluding checksum): 2
Expected checksum (CRC32): e80f2c45
Actual checksum (CRC32): 0
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_GET_VERTEX_EDGES_COMPACT_REPLY
Source: 5
Destination: 5
sizeof(MessageUnit): 8
Count (excluding checksum): 2
Expected checksum (CRC32): 42
Actual checksum (CRC32): a906f61
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_REQUEST_VERTEX_COVERAGE
Source: 12
Destination: 12
sizeof(MessageUnit): 8
Count (excluding checksum): 3
Expected checksum (CRC32): 5b6f1504
Actual checksum (CRC32): d5b3049a
Error: RayPlatform detected a message corruption !
Tag: RAY_MPI_TAG_REQUEST_VERTEX_READS
Source: 27
Destination: 27
sizeof(MessageUnit): 8
Count (excluding checksum): 5
Expected checksum (CRC32): fc01eda4
Actual checksum (CRC32): 0
I guess this is when the Open-MPI PML (point-to-point messaging layer)
dr (data reliability) would be helpful.
I now have a open case with the QLogic support.
Thank you for your help.
Jeff Squyres a écrit :
Yes, PSM is the native transport for InfiniPath. It is faster than the
InfiniBand verbs support on the same hardware.
What version of Open MPI are you using?
On Jun 28, 2012, at 10:03 PM, Sébastien Boisvert wrote:
Hello,
I am getting random crashes (segmentation faults) on a super computer
(guillimin)
using 3 nodes with 12 cores per node. The same program (Ray) runs without any
problem on the other super computers I use.
The interconnect is "InfiniBand: QLogic Corp. InfiniPath QME7342 QDR HCA" and
the messages transit using "performance scaled messaging" (PSM) which I think
is some
sort of replacement to Infiniband verbs although I am not sure.
Adding '--mca mtl ^psm' to the Open-MPI mpiexec program options solves
the problem, but increases the latency from 20 microseconds to 55 microseconds.
There seems to be some sort of message corruption during the transit, but I can
not rule out
other explanations.
I have no idea what is going on and why disabling PSM solves the problem.
Versions
module load gcc/4.5.3
module load openmpi/1.4.3-gcc
Command that randomly crashes
mpiexec -n 36 -output-filename MiSeq-bug-2012-06-28.1 \
Ray -k 31 \
-o MiSeq-bug-2012-06-28.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
Command that completes successfully
mpiexec -n 36 -output-filename psm-bug-2012-06-26-hotfix.1 \
--mca mtl ^psm \
Ray -k 31 \
-o psm-bug-2012-06-26-hotfix.1 \
-p \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R1.fastq \
data-for-system-tests/ecoli-MiSeq/MiSeq_Ecoli_MG1655_110527_R2.fastq
Sébastien Boisvert
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users