Some more background information:
1) the environment is all run inside an initrd with a static pbs_mom.
2) the file we change in the torque distributions is:
        torque-2.1.2/src/include/dis.h
---
255 /* NOTE:  increase THE_BUF_SIZE to 131072 for systems > 5k nodes */
256 
257 /* OLD: #define THE_BUF_SIZE 262144 max size of tcp send buffer
(must be big enough to contain all job attributes) */
258 #define THE_BUF_SIZE 1048576 /* max size of tcp send buffer (must be
big enough to contain all job attributes) */ 
---

Originally it was set to 262144, but we've increased it.  We believe
that:

1) somewhere here:
        torque-2.1.2/src/lib/Libifl/tcp_dis.c
In the tcp_puts function that some buffer may not be getting flushed
correctly, and the pbs_mom restart is fixing it.
We've run 8k mpi processes using a dynamically linked pbs_mom
# ldd /apps/torque/sbin/pbs_mom
        libutil.so.1 => /lib64/libutil.so.1 (0x0000002a9566c000)
        libtorque.so.0 => /apps/torque-2.1.2/lib/libtorque.so.0
(0x0000002a95770000)
        libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a958c0000)
        /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000)
2) or the ash shell not able to set the following limits properly (less
likely):
$ ulimit -u 65536
-sh: ulimit: Illegal option -u
$ ulimit -i 4096
-sh: ulimit: Illegal option -I

Could be the issue.  We've tried varying sysctl settings and have not
seen improvements:
---
$ sysctl -a | grep 262144
net.ipv4.tcp_mem = 196608       262144  393216
net.ipv4.ipfrag_high_thresh = 262144
net.core.rmem_default = 262144
net.core.wmem_default = 262144
---

-----Original Message-----
From: owner-tbird-ad...@sandia.gov [mailto:owner-tbird-ad...@sandia.gov]
On Behalf Of Jeff Squyres
Sent: Saturday, October 21, 2006 7:53 AM
To: Ogden, Jeffry Brandon
Cc: Open MPI Users; tbird-admin
Subject: Re: [OMPI users] OMPI launching problem using TM and openib on
1920 nodes

For those following this thread: there was off-list discussion about
this topic -- re-starting the Torque daemons *seemed* to fix the
problem.


On Oct 20, 2006, at 6:00 PM, Ogden, Jeffry Brandon wrote:

> We don't actually have the capability to test the mpiexec + MVAPICH 
> launch at the moment. I was able to get a job to launch at 1920 and 
> I'm waiting for it to finish. When it is done, I can at least try an 
> mpiexec -comm=none launch to see how TM responds to it.
>
>> -----Original Message-----
>> From: owner-tbird-ad...@sandia.gov
>> [mailto:owner-tbird-ad...@sandia.gov] On Behalf Of Jeff Squyres
>> Sent: Friday, October 20, 2006 1:17 PM
>> To: Open MPI Users
>> Cc: tbird-admin
>> Subject: Re: [OMPI users] OMPI launching problem using TM and openib 
>> on 1920 nodes
>>
>> This message is coming from torque:
>>
>> [15:15] 69-94-204-35:~/Desktop/torque-2.1.2 % grep -r "out of space 
>> in buffer and cannot commit message" *
>> src/lib/Libifl/tcp_dis.c:      DBPRT(("%s: error!  out of space in
>> buffer and cannot commit message (bufsize=%d, buflen=%d, ct=%d)\n",
>>
>> Are you able to use OSC mpiexec to launch over the same number of 
>> nodes, perchance?
>>
>>
>> On Oct 20, 2006, at 12:23 PM, Ogden, Jeffry Brandon wrote:
>>
>>> We are having quite a bit of trouble reliably launching larger jobs 
>>> (1920 nodes, 1 ppn) with OMPI (1.1.2rc4 with gcc) at the
>> moment.  The
>>> launches usually either just hang or fail with output like:
>>>
>>> Cbench numprocs: 1920
>>> Cbench numnodes: 1921
>>> Cbench ppn: 1
>>> Cbench jobname: xhpl-1ppn-1920
>>> Cbench joblaunchmethod: openmpi
>>>
>>> tcp_puts: error!  out of space in buffer and cannot commit message 
>>> (bufsize=262144, buflen=261801, ct=450)
>>>
>>> [cn1023:02832] pls:tm: start_procs returned error -1 [cn1023:02832] 
>>> [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at line
>>> 186
>>> [cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at 
>>> line 490 [cn1023:02832] orterun: spawn failed with errno=-1 
>>> [dn622:00631] [0,0,43]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn583:00606] [0,0,7]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn584:00606] [0,0,8]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn585:00604] [0,0,9]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn591:00606] [0,0,15]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn592:00604] [0,0,16]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn582:00607] [0,0,6]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn588:00605] [0,0,12]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>> [dn590:00606] [0,0,14]-[0,0,0] mca_oob_tcp_msg_recv: readv
>> failed with
>>> errno=104
>>>
>>> The OMPI environment parameters we are using are:
>>>  %env | grep OMPI
>>>  OMPI_MCA_oob_tcp_include=eth0
>>>  OMPI_MCA_oob_tcp_listen_mode=listen_thread
>>>  OMPI_MCA_btl_openib_ib_timeout=18
>>>  OMPI_MCA_oob_tcp_listen_thread_max_time=100
>>>  OMPI_MCA_oob_tcp_listen_thread_max_queue=100
>>>  OMPI_MCA_btl_tcp_if_include=eth0
>>>  OMPI_MCA_btl_openib_ib_retry_count=15
>>>  OMPI_MCA_btl_openib_ib_cq_size=65536
>>>  OMPI_MCA_rmaps_base_schedule_policy=node
>>>
>>> I have full output with generated from the following OMPI params
>>> attached:
>>>  export OMPI_MCA_pls_tm_debug=1
>>>  export OMPI_MCA_pls_tm_verbose=1
>>>
>>> We are running Toruqe 2.1.2.  I'm mostly suspicious of the tcp_puts 
>>> error and the 262144 bufsize limit... Any ideas?  Thanks.
>>> <xhpl-1ppn-1920..o127407>
>>> <xhpl-1ppn-1920..e127407>
>>> <mime-attachment.txt>
>>
>>
>> --
>> Jeff Squyres
>> Server Virtualization Business Unit
>> Cisco Systems
>>
>>
>> -------------------
>>
>>


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


-------------------



Reply via email to