Hi, What OFED vendor and version do you use? Regards M
On Tue, Jul 30, 2013 at 8:42 PM, Paul Kapinos <kapi...@rz.rwth-aachen.de>wrote: > Dear Open MPI experts, > > An user at our cluster has a problem running a kinda of big job: > (- the job using 3024 processes (12 per node, 252 nodes) runs fine) > - the job using 4032 processes (12 per node, 336 nodes) produce the error > attached below. > > Well, the http://www.open-mpi.org/faq/?**category=openfabrics#ib-** > locked-pages<http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages>is > well-known one; both recommended tweakables (user limits and registered > memory size) are at MAX now, nevertheless someone queue pair could not be > created. > > Our blind guess is the number of completion queues is exhausted. > > What happen' when raising the value from standard to max? > What max size of Open MPI jobs have been seen at all? > What max size of Open MPI jobs *using MPI_Alltoallv* have been seen at all? > Is there a way to manage the size/the number of queue pairs? (XRC not > availabe) > Is there a way to tell MPI_Alltoallv to use less queue pairs, even when > this could lead to slow-down? > > There is a suspicious parameter in the mlx4_core module: > $ modinfo mlx4_core | grep log_num_cq > parm: log_num_cq:log maximum number of CQs per HCA (int) > > Is this the tweakable parameter? > What is the default, and max value? > > Any help would be welcome... > > Best, > > Paul Kapinos > > P.S. There should be no connection problen somewhere between the nodes; a > test job with 1x process on each node has been ran sucessfully just before > starting the actual job, which also ran OK for a while - until calling > MPI_Alltoallv. > > > > > > > ------------------------------**------------------------------** > -------------- > A process failed to create a queue pair. This usually means either > the device has run out of queue pairs (too many connections) or > there are insufficient resources available to allocate a queue pair > (out of memory). The latter can happen if either 1) insufficient > memory is available, or 2) no more physical memory can be registered > with the device. > > For more information on memory registration see the Open MPI FAQs at: > http://www.open-mpi.org/faq/?**category=openfabrics#ib-**locked-pages<http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages> > > Local host: linuxbmc1156.rz.RWTH-Aachen.DE > Local device: mlx4_0 > Queue pair type: Reliable connected (RC) > ------------------------------**------------------------------** > -------------- > [linuxbmc1156.rz.RWTH-Aachen.**DE <http://linuxbmc1156.rz.RWTH-Aachen.DE> > ][[3703,1],4021][connect/**btl_openib_connect_oob.c:867:**rml_recv_cb] > error in endpoint reply start connect > [linuxbmc1156.rz.RWTH-Aachen.**DE:9632<http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] > *** An error occurred in MPI_Alltoallv > [linuxbmc1156.rz.RWTH-Aachen.**DE:9632<http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] > *** on communicator MPI_COMM_WORLD > [linuxbmc1156.rz.RWTH-Aachen.**DE:9632<http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] > *** MPI_ERR_OTHER: known error not in list > [linuxbmc1156.rz.RWTH-Aachen.**DE:9632<http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] > *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort > [linuxbmc1156.rz.RWTH-Aachen.**DE <http://linuxbmc1156.rz.RWTH-Aachen.DE> > ][[3703,1],4024][connect/**btl_openib_connect_oob.c:867:**rml_recv_cb] > error in endpoint reply start connect > [linuxbmc1156.rz.RWTH-Aachen.**DE <http://linuxbmc1156.rz.RWTH-Aachen.DE> > ][[3703,1],4027][connect/**btl_openib_connect_oob.c:867:**rml_recv_cb] > error in endpoint reply start connect > [linuxbmc0840.rz.RWTH-Aachen.**DE <http://linuxbmc0840.rz.RWTH-Aachen.DE> > ][[3703,1],10][connect/btl_**openib_connect_oob.c:867:rml_**recv_cb] > error in endpoint reply start connect > [linuxbmc0840.rz.RWTH-Aachen.**DE <http://linuxbmc0840.rz.RWTH-Aachen.DE> > ][[3703,1],1][connect/btl_**openib_connect_oob.c:867:rml_**recv_cb] error > in endpoint reply start connect > [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] > [[3703,0],0]-[[3703,1],10] mca_oob_tcp_msg_recv: readv failed: Connection > reset by peer (104) > [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] > [[3703,0],0]-[[3703,1],8] mca_oob_tcp_msg_recv: readv failed: Connection > reset by peer (104) > [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] > [[3703,0],0]-[[3703,1],9] mca_oob_tcp_msg_recv: readv failed: Connection > reset by peer (104) > [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] > [[3703,0],0]-[[3703,1],1] mca_oob_tcp_msg_recv: readv failed: Connection > reset by peer (104) > [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] > 9 more processes have sent help message help-mpi-btl-openib-cpc-base.**txt > / ibv_create_qp failed > [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] > Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error > messages > [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] > 3 more processes have sent help message help-mpi-errors.txt / > mpi_errors_are_fatal > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, Center for Computing and Communication > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >