We don't actually have the capability to test the mpiexec + MVAPICH
launch at the moment. I was able to get a job to launch at 1920 and I'm
waiting for it to finish. When it is done, I can at least try an mpiexec
-comm=none launch to see how TM responds to it.

> -----Original Message-----
> From: owner-tbird-ad...@sandia.gov 
> [mailto:owner-tbird-ad...@sandia.gov] On Behalf Of Jeff Squyres
> Sent: Friday, October 20, 2006 1:17 PM
> To: Open MPI Users
> Cc: tbird-admin
> Subject: Re: [OMPI users] OMPI launching problem using TM and 
> openib on 1920 nodes
> 
> This message is coming from torque:
> 
> [15:15] 69-94-204-35:~/Desktop/torque-2.1.2 % grep -r "out of space  
> in buffer and cannot commit message" *
> src/lib/Libifl/tcp_dis.c:      DBPRT(("%s: error!  out of space in  
> buffer and cannot commit message (bufsize=%d, buflen=%d, ct=%d)\n",
> 
> Are you able to use OSC mpiexec to launch over the same number of  
> nodes, perchance?
> 
> 
> On Oct 20, 2006, at 12:23 PM, Ogden, Jeffry Brandon wrote:
> 
> > We are having quite a bit of trouble reliably launching larger jobs
> > (1920 nodes, 1 ppn) with OMPI (1.1.2rc4 with gcc) at the 
> moment.  The
> > launches usually either just hang or fail with output like:
> >
> > Cbench numprocs: 1920
> > Cbench numnodes: 1921
> > Cbench ppn: 1
> > Cbench jobname: xhpl-1ppn-1920
> > Cbench joblaunchmethod: openmpi
> >
> > tcp_puts: error!  out of space in buffer and cannot commit message
> > (bufsize=262144, buflen=261801, ct=450)
> >
> > [cn1023:02832] pls:tm: start_procs returned error -1
> > [cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at  
> > line
> > 186
> > [cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at  
> > line
> > 490
> > [cn1023:02832] orterun: spawn failed with errno=-1
> > [dn622:00631] [0,0,43]-[0,0,0] mca_oob_tcp_msg_recv: readv 
> failed with
> > errno=104
> > [dn583:00606] [0,0,7]-[0,0,0] mca_oob_tcp_msg_recv: readv 
> failed with
> > errno=104
> > [dn584:00606] [0,0,8]-[0,0,0] mca_oob_tcp_msg_recv: readv 
> failed with
> > errno=104
> > [dn585:00604] [0,0,9]-[0,0,0] mca_oob_tcp_msg_recv: readv 
> failed with
> > errno=104
> > [dn591:00606] [0,0,15]-[0,0,0] mca_oob_tcp_msg_recv: readv 
> failed with
> > errno=104
> > [dn592:00604] [0,0,16]-[0,0,0] mca_oob_tcp_msg_recv: readv 
> failed with
> > errno=104
> > [dn582:00607] [0,0,6]-[0,0,0] mca_oob_tcp_msg_recv: readv 
> failed with
> > errno=104
> > [dn588:00605] [0,0,12]-[0,0,0] mca_oob_tcp_msg_recv: readv 
> failed with
> > errno=104
> > [dn590:00606] [0,0,14]-[0,0,0] mca_oob_tcp_msg_recv: readv 
> failed with
> > errno=104
> >
> > The OMPI environment parameters we are using are:
> >  %env | grep OMPI
> >  OMPI_MCA_oob_tcp_include=eth0
> >  OMPI_MCA_oob_tcp_listen_mode=listen_thread
> >  OMPI_MCA_btl_openib_ib_timeout=18
> >  OMPI_MCA_oob_tcp_listen_thread_max_time=100
> >  OMPI_MCA_oob_tcp_listen_thread_max_queue=100
> >  OMPI_MCA_btl_tcp_if_include=eth0
> >  OMPI_MCA_btl_openib_ib_retry_count=15
> >  OMPI_MCA_btl_openib_ib_cq_size=65536
> >  OMPI_MCA_rmaps_base_schedule_policy=node
> >
> > I have full output with generated from the following OMPI params
> > attached:
> >  export OMPI_MCA_pls_tm_debug=1
> >  export OMPI_MCA_pls_tm_verbose=1
> >
> > We are running Toruqe 2.1.2.  I'm mostly suspicious of the tcp_puts
> > error and the 262144 bufsize limit... Any ideas?  Thanks.
> > <xhpl-1ppn-1920..o127407>
> > <xhpl-1ppn-1920..e127407>
> > <mime-attachment.txt>
> 
> 
> -- 
> Jeff Squyres
> Server Virtualization Business Unit
> Cisco Systems
> 
> 
> -------------------
> 
> 


Reply via email to