Ugh! If you dont' get to this by Friday and I'm able to get the XGrid bug
knocked out quickly, I'll take a look. I remember being worried about
that case when I fixed up the OOB connection code, but thought I convinced
myself it was right. Apparently not - I wonder if I got a loop wrong and
No problem - glad we could help!
However, I am going to file this as a bug. The oob is supposed to
cycle through -all- the available interfaces when attempting to form a
connection to a remote process, and select the one that allows it to
connect. It shouldn't have "fixated" on the first on
Yes, -mca oob_tcp_if_exclude eth0, worked O.K., even though some
machines have no eth0.
Thanks,
DM
On Tue, 10 Mar 2009, Ralph Castain wrote:
Ick. We don't have a way currently to allow you to ignore an interface on a
node-by-node basis. If you do:
-mca oob_tcp_if_exclude eth0
we will exclud
I queued up a job to try this - will let you know.
I do have the authority to ifdown those rogue eth0 as they are only an
artifact of our install (no cables) and will do that afterwards.
Thanks.
On Tue, 10 Mar 2009, Ralph Castain wrote:
Ick. We don't have a way currently to allow you to ignore
You *could* have a per-machine mca param config file that could be
locally staged on each machine and setup with the exclude for whatever
you need on *that* node. Ugly, but it could work...?
On Mar 10, 2009, at 4:26 PM, Ralph Castain wrote:
Ick. We don't have a way currently to allow you
Ick. We don't have a way currently to allow you to ignore an interface
on a node-by-node basis. If you do:
-mca oob_tcp_if_exclude eth0
we will exclude that private Ethernet. The catch is that we will
exclude "eth0" on -every- node. On the two machines you note here,
that will still let us
Maybe I know why now but it's not pleasant, e.g. 2 machines in the same
cluster have their ethernets such as:
Machine s0157
eth2 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A8
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
Not really. I've run much bigger jobs than this without problem, so I
don't think there is a fundamental issue here.
It looks like the TCP fabric between the various nodes is breaking
down. I note in the enclosed messages that the problems are all with
comm between daemons 4 and 21. We keep
Latest status - 1.4a1r20757 (yesterday);
the job now starts with a little output but quickly runs into trouble with
a lot of
'oob-tcp: Communication retries exceeded. Can not communicate with peer '
errors?
e.g.
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.
I think I have this figured out - will fix on Monday. I'm not sure why
Jeff's conditions are all required, especially the second one.
However, the fundamental problem is that we pull information from two
sources regarding the number of procs in the job when unpacking a
buffer, and the two s
Unfortunately, I think I have reproduced the problem as well -- with
SVN trunk HEAD (r20655):
[15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
[svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack
failed in file base/odls_base_default_fns.c at line 566
---
With further investigation, I have reproduced this problem. I think I
was originally testing against a version that was not recent enough. I
do not see it with r20594 which is from February 19. So, something must
have happened over the last 8 days. I will try and narrow down the issue.
Rol
I just tried trunk-1.4a1r20458 and I did not see this error, although my
configuration was rather different. I ran across 100 2-CPU sparc nodes,
np=256, connected with TCP.
Hopefully George's comment helps out with this issue.
One other thought to see whether SGE has anything to do with thi
Last time I got such an error was when the shared libraries on my head
node didn't match the one loaded by the compute nodes. It was a simple
LD_LIBRARY_PATH mistake from my part. And it was the last time I
didn't build my tree with --enable-mpirun-prefix-by-default.
george.
On Feb 26, 2
FWIW: I tested the trunk tonight using both SLURM and rsh launchers,
and everything checks out fine. However, this is running under SGE and
thus using qrsh, so it is possible the SGE support is having a problem.
Perhaps one of the Sun OMPI developers can help here?
Ralph
On Feb 26, 2009, at
It looks like the system doesn't know what nodes the procs are to be
placed upon. Can you run this with --display-devel-map? That will tell
us where the system thinks it is placing things.
Thanks
Ralph
On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
Maybe it's my pine mailer.
This is a N
Maybe it's my pine mailer.
This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
shangai nodes running a standard benchmark called stmv.
The basic error message, which occurs 31 times is like:
[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/o
I'm sorry, but I can't make any sense of this message. Could you
provide a little explanation of what you are doing, what the system
looks like, what is supposed to happen, etc? I can barely parse your
cmd line...
Thanks
Ralph
On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
Today's and
18 matches
Mail list logo