Re: [OMPI users] Latest SVN failures

Jeff Squyres Tue, 10 Mar 2009 16:32:55 -0400

You *could* have a per-machine mca param config file that could belocally staged on each machine and setup with the exclude for whateveryou need on *that* node. Ugly, but it could work...?


On Mar 10, 2009, at 4:26 PM, Ralph Castain wrote:

Ick. We don't have a way currently to allow you to ignore an interface
on a node-by-node basis. If you do:

-mca oob_tcp_if_exclude eth0

we will exclude that private Ethernet. The catch is that we will
exclude "eth0" on -every- node. On the two machines you note here,
that will still let us work - but I don't know if we will catch an
"eth0" on another node where we need it.

Can you give it a try and see if it works?
Ralph

On Mar 10, 2009, at 2:13 PM, Mostyn Lewis wrote:

> Maybe I know why now but it's not pleasant, e.g. 2 machines in the
> same
> cluster have their ethernets such as:
>
> Machine s0157
>
> eth2      Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A8
>          BROADCAST MULTICAST  MTU:1500  Metric:1
>          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>          Interrupt:233 Base address:0x6000
>
> eth3      Link encap:Ethernet  HWaddr 00:1E:68:DA:74:A9
>          inet addr:10.173.128.13  Bcast:10.173.255.255  Mask:
> 255.255.0.0
>          inet6 addr: fe80::21e:68ff:feda:74a9/64 Scope:Link
>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>          RX packets:43777910 errors:16 dropped:0 overruns:0 frame:16
>          TX packets:21148848 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:5780065692 (5512.3 Mb)  TX bytes:59140357016
> (56400.6 Mb)
>          Interrupt:50 Base address:0x8000
>
> Machine s0158
>
> eth0      Link encap:Ethernet  HWaddr 00:23:8B:42:10:A9
>          inet addr:7.8.82.158  Bcast:7.8.255.255  Mask:255.255.0.0
>          UP BROADCAST MULTICAST  MTU:1500  Metric:1
>          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>          Interrupt:233 Base address:0x6000
>
> eth1      Link encap:Ethernet  HWaddr 00:23:8B:42:10:AA
>          inet addr:10.173.128.14  Bcast:10.173.255.255  Mask:
> 255.255.0.0
>          inet6 addr: fe80::223:8bff:fe42:10aa/64 Scope:Link
>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>          RX packets:39281716 errors:2 dropped:0 overruns:0 frame:2
>          TX packets:2674296 errors:0 dropped:0 overruns:0 carrier:0
>          collisions:0 txqueuelen:1000
>          RX bytes:5879861483 (5607.4 Mb)  TX bytes:2406041840
> (2294.5 Mb)
>          Interrupt:50 Base address:0x8000
>
> Apart from the eths being on different names (happens when
> installing SuSE SLES 10 SP2)
> on apparently similar machines, I notice theirs a private ethernet
> on s0158 at IP
> 7.8.82.158 - I guess this was used. How to exclude when the eth
> names vary?
>
> DM
>
>
> On Tue, 10 Mar 2009, Ralph Castain wrote:
>
>> Not really. I've run much bigger jobs than this without problem, so
>> I don't think there is a fundamental issue here.
>>
>> It looks like the TCP fabric between the various nodes is breaking
>> down. I note in the enclosed messages that the problems are all
>> with comm between daemons 4 and 21. We keep trying to get through,
>> but failing.
>>
>> I can fix things so we don't endlessly loop when that happens
>> (IIRC, I think we are already supposed to abort, but it appears
>> that isn't working). But the real question is why the comm fails in
>> the first place.
>>
>>
>> On Mar 10, 2009, at 10:50 AM, Mostyn Lewis wrote:
>>
>>> Latest status - 1.4a1r20757 (yesterday);
>>> the job now starts with a little output but quickly runs into
>>> trouble with
>>> a lot of
>>> 'oob-tcp: Communication retries exceeded.  Can not communicate
>>> with peer '
>>> errors?
>>> e.g.
>>> [s0158:22513] [[41245,0],4]-[[41245,0],21] oob-tcp: Communication
>>> retries exceeded.  Can not communicate with peer [s0158:22513]
>>> [[41245,0],4]-[[41245,0],21] oob-tcp: Communication retries
>>> exceeded.  Can not communicate with peer [s0158:22513] [[41245,0],
>>> 4]-[[41245,0],21] oob-tcp: Communication retries exceeded.  Can
>>> not communicate with peer [s0158:22513] [[41245,0],4]-[[41245,0],
>>> 21] oob-tcp: Communication retries exceeded.  Can not communicate
>>> with peer
>>> The nodes are O.K. ...
>>> Any ideas folks?
>>> DM
>>> On Sat, 28 Feb 2009, Ralph Castain wrote:
>>>> I think I have this figured out - will fix on Monday. I'm not
>>>> sure why Jeff's conditions are all required, especially the
>>>> second one. However, the fundamental problem is that we pull
>>>> information from two sources regarding the number of procs in the

>>>> job when unpacking a buffer, and the two sources appear to beout-

>>>> of-sync with each other in certain scenarios.
>>>> The details are beyond the user list. I'll respond here again
>>>> once I get it fixed.
>>>> Ralph
>>>> On Feb 27, 2009, at 4:14 PM, Jeff Squyres wrote:
>>>>> Unfortunately, I think I have reproduced the problem as well --
>>>>> with SVN trunk HEAD (r20655):
>>>>> [15:12] svbu-mpi:~/mpi % mpirun --mca bogus foo --bynode -np 2
>>>>> uptime
>>>>> [svbu-mpi.cisco.com:24112] [[62779,0],0] ORTE_ERROR_LOG: Data
>>>>> unpack failed in file base/odls_base_default_fns.c at line 566

>>>>>--------------------------------------------------------------------------

>>>>> mpirun noticed that the job aborted, but has no info as to the
>>>>> process
>>>>> that caused that situation.

>>>>>--------------------------------------------------------------------------

>>>>> Notice that I'm not trying to run an MPI app -- it's just
>>>>> "uptime".
>>>>> The following things seem to be necessary to make this error
>>>>> occur for me:
>>>>> 1. --bynode
>>>>> 2. set some mca parameter (any mca parameter)
>>>>> 3. -np value less than the size of my slurm allocation
>>>>> If I remove any of those, it seems to run file
>>>>> On Feb 27, 2009, at 5:05 PM, Rolf Vandevaart wrote:
>>>>>> With further investigation, I have reproduced this problem.  I
>>>>>> think I was originally testing against a version that was not
>>>>>> recent enough.  I do not see it with r20594 which is from
>>>>>> February 19.  So, something must have happened over the last 8
>>>>>> days.  I will try and narrow down the issue.
>>>>>> Rolf
>>>>>> On 02/27/09 09:34, Rolf Vandevaart wrote:
>>>>>>> I just tried trunk-1.4a1r20458 and I did not see this error,
>>>>>>> although my configuration was rather different.  I ran across
>>>>>>> 100 2-CPU sparc nodes, np=256, connected with TCP.
>>>>>>> Hopefully George's comment helps out with this issue.
>>>>>>> One other thought to see whether SGE has anything to do with
>>>>>>> this is create a hostfile and run it outside of SGE.
>>>>>>> Rolf
>>>>>>> On 02/26/09 22:10, Ralph Castain wrote:
>>>>>>>> FWIW: I tested the trunk tonight using both SLURM and rsh
>>>>>>>> launchers, and everything checks out fine. However, this is
>>>>>>>> running under SGE and thus using qrsh, so it is possible the
>>>>>>>> SGE support is having a problem.
>>>>>>>> Perhaps one of the Sun OMPI developers can help here?
>>>>>>>> Ralph
>>>>>>>> On Feb 26, 2009, at 7:21 PM, Ralph Castain wrote:
>>>>>>>>> It looks like the system doesn't know what nodes the procs

>>>>>>>>> are to be placed upon. Can you run this with --display-devel-

>>>>>>>>> map? That will tell us where the system thinks it is placing
>>>>>>>>> things.
>>>>>>>>> Thanks
>>>>>>>>> Ralph
>>>>>>>>> On Feb 26, 2009, at 3:41 PM, Mostyn Lewis wrote:
>>>>>>>>>> Maybe it's my pine mailer.
>>>>>>>>>> This is a NAMD run on 256 procs across 32 dual-socket quad-
>>>>>>>>>> core AMD
>>>>>>>>>> shangai nodes running a standard benchmark called stmv.
>>>>>>>>>> The basic error message, which occurs 31 times is like:
>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
>>>>>>>>>> file ../../../.././orte/mca/odls/base/
>>>>>>>>>> odls_base_default_fns.c at line 595
>>>>>>>>>> The mpirun command has long paths in it, sorry. It's
>>>>>>>>>> invoking a special binding
>>>>>>>>>> script which in turn lauches the NAMD run. This works on an
>>>>>>>>>> older SVN at
>>>>>>>>>> level 1.4a1r20123 (for 16,32,64,128 and 512 procs)but not
>>>>>>>>>> for this 256 proc run where
>>>>>>>>>> the older SVN hangs indefinitely polling some completion
>>>>>>>>>> (sm or openib). So, I was trying
>>>>>>>>>> later SVNs with this 256 proc run, hoping the error would
>>>>>>>>>> go away.
>>>>>>>>>> Here's some of the invocation again. Hope you can read it:
>>>>>>>>>> EAGER_SIZE=32767
>>>>>>>>>> export OMPI_MCA_btl_openib_use_eager_rdma=0
>>>>>>>>>> export OMPI_MCA_btl_openib_eager_limit=$EAGER_SIZE
>>>>>>>>>> export OMPI_MCA_btl_self_eager_limit=$EAGER_SIZE
>>>>>>>>>> export OMPI_MCA_btl_sm_eager_limit=$EAGER_SIZE
>>>>>>>>>> and, unexpanded
>>>>>>>>>> mpirun --prefix $PREFIX -np %PE% $MCA -x
>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x
>>>>>>>>>> OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile $HOSTS $MPI_BINDER
>>>>>>>>>> $NAMD2 stmv.namd
>>>>>>>>>> and, expanded
>>>>>>>>>> mpirun --prefix /tools/openmpi/1.4a1r20643_svn/connectx/
>>>>>>>>>> intel64/10.1.015/openib/suse_sles_10/x86_64/opteron -np 256
>>>>>>>>>> --mca btl sm,openib,self -x
>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x
>>>>>>>>>> OMPI_MCA_btl_openib_eager_limit -x
>>>>>>>>>> OMPI_MCA_btl_self_eager_limit -x

>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48292.1.all.q/

>>>>>>>>>> newhosts /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>>>>>>>>>> mpi_binder.MRL /ctmp8/mostyn/IMSC/
>>>>>>>>>> bench_intel_openmpi_I_shang2/
>>>>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
>>>>>>>>>> NAMD_2.6_Source/Linux-amd64-MPI/namd2 stmv.namd
>>>>>>>>>> This is all via Sun Grid Engine.
>>>>>>>>>> The OS as indicated above is SuSE SLES 10 SP2.
>>>>>>>>>> DM
>>>>>>>>>> On Thu, 26 Feb 2009, Ralph Castain wrote:
>>>>>>>>>>> I'm sorry, but I can't make any sense of this message.
>>>>>>>>>>> Could you provide a
>>>>>>>>>>> little explanation of what you are doing, what the system
>>>>>>>>>>> looks like, what is

>>>>>>>>>>> supposed to happen, etc? I can barely parse your cmdline...

>>>>>>>>>>> Thanks
>>>>>>>>>>> Ralph
>>>>>>>>>>> On Feb 26, 2009, at 1:03 PM, Mostyn Lewis wrote:
>>>>>>>>>>>> Today's and yesterdays.
>>>>>>>>>>>> 1.4a1r20643_svn
>>>>>>>>>>>> + mpirun --prefix
>>>>>>>>>>>> /tools/openmpi/1.4a1r20643_svn/connectx/intel64/10.1.015/
>>>>>>>>>>>> openib/suse_sles_10/x86_6
>>>>>>>>>>>> 4/opteron -np 256 --mca btl sm,openib,self -x
>>>>>>>>>>>> OMPI_MCA_btl_openib_use_eager_rdma -x OMPI_MCA_btl_ope
>>>>>>>>>>>> nib_eager_limit -x OMPI_MCA_btl_self_eager_limit -x
>>>>>>>>>>>> OMPI_MCA_btl_sm_eager_limit -machinefile /tmp/48
>>>>>>>>>>>> 269.1.all.q/newhosts
>>>>>>>>>>>> /ctmp8/mostyn/IMSC/bench_intel_openmpi_I_shang2/
>>>>>>>>>>>> mpi_binder.MRL
>>>>>>>>>>>> /ctmp8/mostyn/IM
>>>>>>>>>>>> SC/bench_intel_openmpi_I_shang2/
>>>>>>>>>>>> intel-10.1.015_ofed_1.3.1_openmpi_1.4a1r20643_svn/
>>>>>>>>>>>> NAMD_2.6_Source/Li
>>>>>>>>>>>> nux-amd64-MPI/namd2 stmv.namd
>>>>>>>>>>>> [s0164:24296] [[64102,0],16] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file
>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>> [s0128:24439] [[64102,0],4] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file
>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls_
>>>>>>>>>>>> base_default_fns.c at line 595
>>>>>>>>>>>> [s0156:29300] [[64102,0],12] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file
>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>> [s0168:20585] [[64102,0],20] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file
>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>> [s0181:19554] [[64102,0],28] ORTE_ERROR_LOG: Not found in
>>>>>>>>>>>> file
>>>>>>>>>>>> ../../../.././orte/mca/odls/base/odls
>>>>>>>>>>>> _base_default_fns.c at line 595
>>>>>>>>>>>> Made with INTEL compilers 10.1.015.
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Mostyn
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> --
>>>>>> =========================
>>>>>> rolf.vandeva...@sun.com
>>>>>> 781-442-3043
>>>>>> =========================
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> --
>>>>> Jeff Squyres
>>>>> Cisco Systems
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] Latest SVN failures

Reply via email to