HI All,
I have successfully install and configured openmpi to perfrom
checkpointing using the BLCR mechanism. However, i now want to to try
checkpointing using self.
Has anyone do that? If so, i would very much appreciate if anyone of you could
sent be the steps necessary to enable slef
Dear,
I developed one application using openmpi in c++. This application should
start internally (by system call) another application which is also
developed in c++ and openmpi. When this external application is called with
C system function the following messages are showed:
[localhost.localdom
Yeah, I've started seeing this on clusters where the TCP stack is a
little congested. We default to trying 60 times to send a message, but
it is done in rapid succession and doesn't really provide a lot of time.
Try setting -mca oob_tcp_peer_retries 1000 (or some number much bigger
than 60)
Just got this in a user job.
Any idea why it complains like this.
The original error was the infamous "RETRY EXCEEDED ERROR" but instead
of killing the job it showed this and never died.
I have never seen this happen before.
openmpi 1.3.2, built with intel 10.1
This binary is used ALOT (+50% of th