Re: [OMPI users] Latest SVN failures

2009-03-11 Thread Brian W. Barrett
Ugh! If you dont' get to this by Friday and I'm able to get the XGrid bug knocked out quickly, I'll take a look. I remember being worried about that case when I fixed up the OOB connection code, but thought I convinced myself it was right. Apparently not - I wonder if I got a loop wrong and

Re: [OMPI users] Latest SVN failures

2009-03-11 Thread Ralph Castain
No problem - glad we could help! However, I am going to file this as a bug. The oob is supposed to cycle through -all- the available interfaces when attempting to form a connection to a remote process, and select the one that allows it to connect. It shouldn't have "fixated" on the first on

Re: [OMPI users] Latest SVN failures

2009-03-11 Thread Mostyn Lewis
Yes, -mca oob_tcp_if_exclude eth0, worked O.K., even though some machines have no eth0. Thanks, DM On Tue, 10 Mar 2009, Ralph Castain wrote: Ick. We don't have a way currently to allow you to ignore an interface on a node-by-node basis. If you do: -mca oob_tcp_if_exclude eth0 we will exclud

Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Mostyn Lewis
I queued up a job to try this - will let you know. I do have the authority to ifdown those rogue eth0 as they are only an artifact of our install (no cables) and will do that afterwards. Thanks. On Tue, 10 Mar 2009, Ralph Castain wrote: Ick. We don't have a way currently to allow you to ignore

Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Jeff Squyres
You *could* have a per-machine mca param config file that could be locally staged on each machine and setup with the exclude for whatever you need on *that* node. Ugly, but it could work...? On Mar 10, 2009, at 4:26 PM, Ralph Castain wrote: Ick. We don't have a way currently to allow you

Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Ralph Castain
Ick. We don't have a way currently to allow you to ignore an interface on a node-by-node basis. If you do: -mca oob_tcp_if_exclude eth0 we will exclude that private Ethernet. The catch is that we will exclude "eth0" on -every- node. On the two machines you note here, that will still let us

Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Mostyn Lewis
Maybe I know why now but it's not pleasant, e.g. 2 machines in the same cluster have their ethernets such as: Machine s0157 eth2 Link encap:Ethernet HWaddr 00:1E:68:DA:74:A8 BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0

Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Ralph Castain
Not really. I've run much bigger jobs than this without problem, so I don't think there is a fundamental issue here. It looks like the TCP fabric between the various nodes is breaking down. I note in the enclosed messages that the problems are all with comm between daemons 4 and 21. We keep

Re: [OMPI users] Latest SVN failures

2009-03-10 Thread Mostyn Lewis
Latest status - 1.4a1r20757 (yesterday); the job now starts with a little output but quickly runs into trouble with a lot of 'oob-tcp: Communication retries exceeded. Can not communicate with peer ' errors? e.g. [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded.

Re: [OMPI users] Latest SVN failures

2009-02-28 Thread Ralph Castain
I think I have this figured out - will fix on Monday. I'm not sure why Jeff's conditions are all required, especially the second one. However, the fundamental problem is that we pull information from two sources regarding the number of procs in the job when unpacking a buffer, and the two s

Re: [OMPI users] Latest SVN failures

2009-02-27 Thread Jeff Squyres
Unfortunately, I think I have reproduced the problem as well -- with SVN trunk HEAD (r20655): [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpack failed in file base/odls_base_default_fns.c at line 566 ---

Re: [OMPI users] Latest SVN failures

2009-02-27 Thread Rolf Vandevaart
With further investigation, I have reproduced this problem. I think I was originally testing against a version that was not recent enough. I do not see it with r20594 which is from February 19. So, something must have happened over the last 8 days. I will try and narrow down the issue. Rol

Re: [OMPI users] Latest SVN failures

2009-02-27 Thread Rolf Vandevaart
I just tried trunk-1.4a1r20458 and I did not see this error, although my configuration was rather different. I ran across 100 2-CPU sparc nodes, np=256, connected with TCP. Hopefully George's comment helps out with this issue. One other thought to see whether SGE has anything to do with thi

Re: [OMPI users] Latest SVN failures

2009-02-26 Thread George Bosilca
Last time I got such an error was when the shared libraries on my head node didn't match the one loaded by the compute nodes. It was a simple LD_LIBRARY_PATH mistake from my part. And it was the last time I didn't build my tree with --enable-mpirun-prefix-by-default. george. On Feb 26, 2

Re: [OMPI users] Latest SVN failures

2009-02-26 Thread Ralph Castain
FWIW: I tested the trunk tonight using both SLURM and rsh launchers, and everything checks out fine. However, this is running under SGE and thus using qrsh, so it is possible the SGE support is having a problem. Perhaps one of the Sun OMPI developers can help here? Ralph On Feb 26, 2009, at

Re: [OMPI users] Latest SVN failures

2009-02-26 Thread Ralph Castain
It looks like the system doesn't know what nodes the procs are to be placed upon. Can you run this with --display-devel-map? That will tell us where the system thinks it is placing things. Thanks Ralph On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote: Maybe it's my pine mailer. This is a N

Re: [OMPI users] Latest SVN failures

2009-02-26 Thread Mostyn Lewis
Maybe it's my pine mailer. This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD shangai nodes running a standard benchmark called stmv. The basic error message, which occurs 31 times is like: [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file ../../../.././orte/mca/o

Re: [OMPI users] Latest SVN failures

2009-02-26 Thread Ralph Castain
I'm sorry, but I can't make any sense of this message. Could you provide a little explanation of what you are doing, what the system looks like, what is supposed to happen, etc? I can barely parse your cmd line... Thanks Ralph On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote: Today's and