-----Original Message-----
From: owner-tbird-ad...@sandia.gov
[mailto:owner-tbird-ad...@sandia.gov] On Behalf Of Jeff Squyres
Sent: Friday, October 20, 2006 1:17 PM
To: Open MPI Users
Cc: tbird-admin
Subject: Re: [OMPI users] OMPI launching problem using TM and
openib on 1920 nodes
This message is coming from torque:
[15:15] 69-94-204-35:~/Desktop/torque-2.1.2 % grep -r "out of space
in buffer and cannot commit message" *
src/lib/Libifl/tcp_dis.c: DBPRT(("%s: error! out of space in
buffer and cannot commit message (bufsize=%d, buflen=%d, ct=%d)\n",
Are you able to use OSC mpiexec to launch over the same number of
nodes, perchance?
On Oct 20, 2006, at 12:23 PM, Ogden, Jeffry Brandon wrote:
We are having quite a bit of trouble reliably launching larger jobs
(1920 nodes, 1 ppn) with OMPI (1.1.2rc4 with gcc) at the
moment. The
launches usually either just hang or fail with output like:
Cbench numprocs: 1920
Cbench numnodes: 1921
Cbench ppn: 1
Cbench jobname: xhpl-1ppn-1920
Cbench joblaunchmethod: openmpi
tcp_puts: error! out of space in buffer and cannot commit message
(bufsize=262144, buflen=261801, ct=450)
[cn1023:02832] pls:tm: start_procs returned error -1
[cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at
line
186
[cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at
line
490
[cn1023:02832] orterun: spawn failed with errno=-1
[dn622:00631] [0,0,43]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with
errno=104
[dn583:00606] [0,0,7]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with
errno=104
[dn584:00606] [0,0,8]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with
errno=104
[dn585:00604] [0,0,9]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with
errno=104
[dn591:00606] [0,0,15]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with
errno=104
[dn592:00604] [0,0,16]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with
errno=104
[dn582:00607] [0,0,6]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with
errno=104
[dn588:00605] [0,0,12]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with
errno=104
[dn590:00606] [0,0,14]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with
errno=104
The OMPI environment parameters we are using are:
%env | grep OMPI
OMPI_MCA_oob_tcp_include=eth0
OMPI_MCA_oob_tcp_listen_mode=listen_thread
OMPI_MCA_btl_openib_ib_timeout=18
OMPI_MCA_oob_tcp_listen_thread_max_time=100
OMPI_MCA_oob_tcp_listen_thread_max_queue=100
OMPI_MCA_btl_tcp_if_include=eth0
OMPI_MCA_btl_openib_ib_retry_count=15
OMPI_MCA_btl_openib_ib_cq_size=65536
OMPI_MCA_rmaps_base_schedule_policy=node
I have full output with generated from the following OMPI params
attached:
export OMPI_MCA_pls_tm_debug=1
export OMPI_MCA_pls_tm_verbose=1
We are running Toruqe 2.1.2. I'm mostly suspicious of the tcp_puts
error and the 262144 bufsize limit... Any ideas? Thanks.
<xhpl-1ppn-1920..o127407>
<xhpl-1ppn-1920..e127407>
<mime-attachment.txt>
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems
-------------------