Some more background information: 1) the environment is all run inside an initrd with a static pbs_mom. 2) the file we change in the torque distributions is: torque-2.1.2/src/include/dis.h --- 255 /* NOTE: increase THE_BUF_SIZE to 131072 for systems > 5k nodes */ 256 257 /* OLD: #define THE_BUF_SIZE 262144 max size of tcp send buffer (must be big enough to contain all job attributes) */ 258 #define THE_BUF_SIZE 1048576 /* max size of tcp send buffer (must be big enough to contain all job attributes) */ ---
Originally it was set to 262144, but we've increased it. We believe that: 1) somewhere here: torque-2.1.2/src/lib/Libifl/tcp_dis.c In the tcp_puts function that some buffer may not be getting flushed correctly, and the pbs_mom restart is fixing it. We've run 8k mpi processes using a dynamically linked pbs_mom # ldd /apps/torque/sbin/pbs_mom libutil.so.1 => /lib64/libutil.so.1 (0x0000002a9566c000) libtorque.so.0 => /apps/torque-2.1.2/lib/libtorque.so.0 (0x0000002a95770000) libc.so.6 => /lib64/tls/libc.so.6 (0x0000002a958c0000) /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000) 2) or the ash shell not able to set the following limits properly (less likely): $ ulimit -u 65536 -sh: ulimit: Illegal option -u $ ulimit -i 4096 -sh: ulimit: Illegal option -I Could be the issue. We've tried varying sysctl settings and have not seen improvements: --- $ sysctl -a | grep 262144 net.ipv4.tcp_mem = 196608 262144 393216 net.ipv4.ipfrag_high_thresh = 262144 net.core.rmem_default = 262144 net.core.wmem_default = 262144 --- -----Original Message----- From: owner-tbird-ad...@sandia.gov [mailto:owner-tbird-ad...@sandia.gov] On Behalf Of Jeff Squyres Sent: Saturday, October 21, 2006 7:53 AM To: Ogden, Jeffry Brandon Cc: Open MPI Users; tbird-admin Subject: Re: [OMPI users] OMPI launching problem using TM and openib on 1920 nodes For those following this thread: there was off-list discussion about this topic -- re-starting the Torque daemons *seemed* to fix the problem. On Oct 20, 2006, at 6:00 PM, Ogden, Jeffry Brandon wrote: > We don't actually have the capability to test the mpiexec + MVAPICH > launch at the moment. I was able to get a job to launch at 1920 and > I'm waiting for it to finish. When it is done, I can at least try an > mpiexec -comm=none launch to see how TM responds to it. > >> -----Original Message----- >> From: owner-tbird-ad...@sandia.gov >> [mailto:owner-tbird-ad...@sandia.gov] On Behalf Of Jeff Squyres >> Sent: Friday, October 20, 2006 1:17 PM >> To: Open MPI Users >> Cc: tbird-admin >> Subject: Re: [OMPI users] OMPI launching problem using TM and openib >> on 1920 nodes >> >> This message is coming from torque: >> >> [15:15] 69-94-204-35:~/Desktop/torque-2.1.2 % grep -r "out of space >> in buffer and cannot commit message" * >> src/lib/Libifl/tcp_dis.c: DBPRT(("%s: error! out of space in >> buffer and cannot commit message (bufsize=%d, buflen=%d, ct=%d)\n", >> >> Are you able to use OSC mpiexec to launch over the same number of >> nodes, perchance? >> >> >> On Oct 20, 2006, at 12:23 PM, Ogden, Jeffry Brandon wrote: >> >>> We are having quite a bit of trouble reliably launching larger jobs >>> (1920 nodes, 1 ppn) with OMPI (1.1.2rc4 with gcc) at the >> moment. The >>> launches usually either just hang or fail with output like: >>> >>> Cbench numprocs: 1920 >>> Cbench numnodes: 1921 >>> Cbench ppn: 1 >>> Cbench jobname: xhpl-1ppn-1920 >>> Cbench joblaunchmethod: openmpi >>> >>> tcp_puts: error! out of space in buffer and cannot commit message >>> (bufsize=262144, buflen=261801, ct=450) >>> >>> [cn1023:02832] pls:tm: start_procs returned error -1 [cn1023:02832] >>> [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at line >>> 186 >>> [cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at >>> line 490 [cn1023:02832] orterun: spawn failed with errno=-1 >>> [dn622:00631] [0,0,43]-[0,0,0] mca_oob_tcp_msg_recv: readv >> failed with >>> errno=104 >>> [dn583:00606] [0,0,7]-[0,0,0] mca_oob_tcp_msg_recv: readv >> failed with >>> errno=104 >>> [dn584:00606] [0,0,8]-[0,0,0] mca_oob_tcp_msg_recv: readv >> failed with >>> errno=104 >>> [dn585:00604] [0,0,9]-[0,0,0] mca_oob_tcp_msg_recv: readv >> failed with >>> errno=104 >>> [dn591:00606] [0,0,15]-[0,0,0] mca_oob_tcp_msg_recv: readv >> failed with >>> errno=104 >>> [dn592:00604] [0,0,16]-[0,0,0] mca_oob_tcp_msg_recv: readv >> failed with >>> errno=104 >>> [dn582:00607] [0,0,6]-[0,0,0] mca_oob_tcp_msg_recv: readv >> failed with >>> errno=104 >>> [dn588:00605] [0,0,12]-[0,0,0] mca_oob_tcp_msg_recv: readv >> failed with >>> errno=104 >>> [dn590:00606] [0,0,14]-[0,0,0] mca_oob_tcp_msg_recv: readv >> failed with >>> errno=104 >>> >>> The OMPI environment parameters we are using are: >>> %env | grep OMPI >>> OMPI_MCA_oob_tcp_include=eth0 >>> OMPI_MCA_oob_tcp_listen_mode=listen_thread >>> OMPI_MCA_btl_openib_ib_timeout=18 >>> OMPI_MCA_oob_tcp_listen_thread_max_time=100 >>> OMPI_MCA_oob_tcp_listen_thread_max_queue=100 >>> OMPI_MCA_btl_tcp_if_include=eth0 >>> OMPI_MCA_btl_openib_ib_retry_count=15 >>> OMPI_MCA_btl_openib_ib_cq_size=65536 >>> OMPI_MCA_rmaps_base_schedule_policy=node >>> >>> I have full output with generated from the following OMPI params >>> attached: >>> export OMPI_MCA_pls_tm_debug=1 >>> export OMPI_MCA_pls_tm_verbose=1 >>> >>> We are running Toruqe 2.1.2. I'm mostly suspicious of the tcp_puts >>> error and the 262144 bufsize limit... Any ideas? Thanks. >>> <xhpl-1ppn-1920..o127407> >>> <xhpl-1ppn-1920..e127407> >>> <mime-attachment.txt> >> >> >> -- >> Jeff Squyres >> Server Virtualization Business Unit >> Cisco Systems >> >> >> ------------------- >> >> -- Jeff Squyres Server Virtualization Business Unit Cisco Systems -------------------