On 1 Sep 2010, at 23:32, Jaison Mulerikkal wrote:
> Hi,
>
> I am getting interested in this thread.
>
> I'm looking for some solutions, where I can redirect a task/message
> (MPI_send) to a particular process (say rank 1), which is in a queue (at rank
> 1) to another process (say rank 2), if t
On 2 Sep 2010, at 15:56, Brock Palen wrote:
> Ashly still having trouble using padb with openmpi/1.4.2
>
> [dianawon@nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q
> [nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp:
> Communication retries exceeded. Can not communicate
Ashly still having trouble using padb with openmpi/1.4.2
[dianawon@nyx0862 ~]$ /home/software/rhel5/padb/3.0/padb -a -Q
[nyx0862.engin.umich.edu:30717] [[16608,0],0]-[[25542,0],0] oob-tcp:
Communication retries exceeded. Can not communicate with peer
[nyx0862.engin.umich.edu:30717] [[16608,0],0]
Ah ok, I put it there just because the user couldn't read that from my home
space, and never even thought of that. gahhh.
Thanks,
BTW I tried joining the padb mailing list.
Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985
On Sep 1, 2010, at 6:11
Hi,
I am getting interested in this thread.
I'm looking for some solutions, where I can redirect a task/message (MPI_send)
to a particular process (say rank 1), which is in a queue (at rank 1) to
another process (say rank 2), if the queue is longer at rank 1.
How can I do it?
First of all, I
padb as a binary (it's a perl script) needs to exist on all nodes as it calls
orterun on itself, try installing it to a shared directory or copying padb to
/tmp on every node.
To access the message queues padb needs a compiled helper program which is
installed in $PREFIX/lib so I would recomme
We have ddt, but we do not have licenses to attach to the number of cores these
jobs run at.
I tried padb, but it fails,
Example:
ssh to root node for running MPI job:
/tmp/padb -Q -a
[nyx0862.engin.umich.edu:25054] [[22211,0],0]-[[25542,0],0] oob-tcp:
Communication retries exceeded. Can n
On 1 Sep 2010, at 21:13, Brock Palen wrote:
> I have a code for a user (namd if anyone cares) that on a specific case will
> lock up, a quick ltrace shows the processes doing Iprobes over and over, so
> this makes me think that a process someplace is blocking on communication.
>
> What is