Re: [OMPI users] Latest SVN failures

Mostyn Lewis Tue, 10 Mar 2009 16:13:49 -0400

Maybe I know why now but it's not pleasant, e.g. 2 machines in the same
cluster have their ethernets such as:


Machine s0157

eth2      Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A8
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Interrupt:233 Base address:0x6000

eth3      Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A9
          inet addr:10.173.128.13  Bcast:10.173.255.255  Mask:255.255.0.0
          inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
          TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5780065692 (5512.3 Mb)  TX bytes:59140357016 (56400.6 Mb)
          Interrupt:50 Base address:0x8000

Machine s0158

eth0      Link encap:Ethernet  HWaddr 00:23:8B:42:10:A9
          inet addr:7.8.82.158  Bcast:7.8.255.255  Mask:255.255.0.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Interrupt:233 Base address:0x6000

eth1      Link encap:Ethernet  HWaddr 00:23:8B:42:10:AA
          inet addr:10.173.128.14  Bcast:10.173.255.255  Mask:255.255.0.0
          inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
          TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5879861483 (5607.4 Mb)  TX bytes:2406041840 (2294.5 Mb)
          Interrupt:50 Base address:0x8000

Apart from the eths being on different names (happens when installing SuSE SLES 
10 SP2)
on apparently similar machines, I notice theirs a private ethernet on s0158 at 
IP
7.8.82.158 - I guess this was used. How to exclude when the eth names vary?

DM


On Tue, 10 Mar 2009, Ralph Castain wrote:

Not really. I've run much bigger jobs than this without problem, so I don'tthink there is a fundamental issue here.
It looks like the TCP fabric between the various nodes is breaking down. Inote in the enclosed messages that the problems are all with comm betweendaemons 4 and 21. We keep trying to get through, but failing.
I can fix things so we don't endlessly loop when that happens (IIRC, I thinkwe are already supposed to abort, but it appears that isn't working). But thereal question is why the comm fails in the first place.
On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:
Latest status - 1.4a1r20757 (yesterday);
the job now starts with a little output but quickly runs into trouble with
a lot of
'oob-tcp: Communication retries exceeded.  Can not communicate with peer '
errors?

e.g.
[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retriesexceeded. Can not communicate with peer [s0158:22513][[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries exceeded. Cannot communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],21]oob-tcp: Communication retries exceeded. Can not communicate with peer[s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retriesexceeded. Can not communicate with peer
The nodes are O.K. ...

Any ideas folks?

DM

On Sat, 28 Feb 2009, Ralph Castain wrote:
I think I have this figured out - will fix on Monday. I'm not sure whyJeff's conditions are all required, especially the second one. However,the fundamental problem is that we pull information from two sourcesregarding the number of procs in the job when unpacking a buffer, and thetwo sources appear to be out-of-sync with each other in certain scenarios.
The details are beyond the user list. I'll respond here again once I getit fixed.
Ralph

On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
Unfortunately, I think I have reproduced the problem as well -- with SVNtrunk HEAD (r20655):
[15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2 uptime
[svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data unpackfailed in file base/odls_base_default_fns.c at line 566
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
Notice that I'm not trying to run an MPI app -- it's just "uptime".
The following things seem to be necessary to make this error occur forme:
1. --bynode
2. set some mca parameter (any mca parameter)
3. -np value less than the size of my slurm allocation
If I remove any of those, it seems to run file
On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
With further investigation, I have reproduced this problem. I think Iwas originally testing against a version that was not recent enough. Ido not see it with r20594 which is from February 19. So, something musthave happened over the last 8 days. I will try and narrow down theissue.
Rolf
On 02/27/09 09:34, Rolf Vandevaart wrote:
I just tried trunk-1.4a1r20458 and I did not see this error, althoughmy configuration was rather different. I ran across 100 2-CPU sparcnodes, np=256, connected with TCP.
Hopefully George's comment helps out with this issue.
One other thought to see whether SGE has anything to do with this iscreate a hostfile and run it outside of SGE.
Rolf
On 02/26/09 22:10, Ralph Castain wrote:
FWIW: I tested the trunk tonight using both SLURM and rsh launchers,and everything checks out fine. However, this is running under SGE andthus using qrsh, so it is possible the SGE support is having aproblem.
Perhaps one of the Sun OMPI developers can help here?
Ralph
On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
It looks like the system doesn't know what nodes the procs are to beplaced upon. Can you run this with --display-devel-map? That willtell us where the system thinks it is placing things.
Thanks
Ralph
On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
Maybe it's my pine mailer.
This is a NAMD run on 256 procs across 32 dual-socket quad-core AMD
shangai nodes running a standard benchmark called stmv.
The basic error message, which occurs 31 times is like:
[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file../../../.././orte/mca/odls/base/odls_base_default_fns.c at line 595The mpirun command has long paths in it, sorry. It's invoking aspecial bindingscript which in turn lauches the NAMD run. This works on an olderSVN atlevel 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not for this256 proc run wherethe older SVN hangs indefinitely polling some completion (sm oropenib). So, I was trying
later SVNs with this 256 proc run, hoping the error would go away.
Here's some of the invocation again. Hope you can read it:
EAGER_SIZE=32767
export OMPI_MCA_btl_openib_use_eager_rdma=0
export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
and, unexpanded
mpirun --prefix $PREFIX -np %PE% $MCA -xOMPI_MCA_btl_openib_use_eager_rdma -xOMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -xOMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER $NAMD2stmv.namd
and, expanded
mpirun --prefix/tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_64/opteron-np 256 --mca btl sm,openib,self -xOMPI_MCA_btl_openib_use_eager_rdma -xOMPI_MCA_btl_openib_eager_limit -x OMPI_MCA_btl_self_eager_limit -xOMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/newhosts/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Linux-amd64-MPI/namd2stmv.namd
This is all via Sun Grid Engine.
The OS as indicated above is SuSE SLES 10 SP2.
DM
On Thu, 26 Feb 2009, Ralph Castain wrote:
I'm sorry, but I can't make any sense of this message. Could youprovide alittle explanation of what you are doing, what the system lookslike, what is
supposed to happen, etc? I can barely parse your cmd line...
Thanks
Ralph
On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
Today's and yesterdays.
1.4a1r20643_svn
+ mpirun --prefix
/tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/openib/suse_sles_10/x86_6
4/opteron -np 256 --mca btl sm,openib,self -x
OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
269.1.all.q/newhosts
/ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/mpi_binder.MRL
/ctmp8/mostyn/IM
SC/bench_intel_openmpi_I_shang2/intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/NAMD_2.6_Source/Li
nux-amd64-MPI/namd2 stmv.namd
[s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls_
base_default_fns.c at line 595
[s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
[s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in file
../../../.././orte/mca/odls/base/odls
_base_default_fns.c at line 595
Made with INTEL compilers 10.1.015.
Regards,
Mostyn
_______________________________________________
users mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
=========================
[email protected]
781-442-3043
=========================
_______________________________________________
users mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Latest SVN failures

Reply via email to