Hi,
What OFED vendor and version do you use?
Regards
M

On Tue, Jul 30, 2013 at 8:42 PM, Paul Kapinos <kapi...@rz.rwth-aachen.de>wrote:

> Dear Open MPI experts,
>
> An user at our cluster has a problem running a kinda of big job:
> (- the job using 3024 processes (12 per node, 252 nodes) runs fine)
> - the job using 4032 processes (12 per node, 336 nodes) produce the error
> attached below.
>
> Well, the http://www.open-mpi.org/faq/?**category=openfabrics#ib-**
> locked-pages<http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages>is
>  well-known one; both recommended tweakables (user limits and registered
> memory size) are at MAX now, nevertheless someone queue pair could not be
> created.
>
> Our blind guess is the number of completion queues is exhausted.
>
> What happen' when raising the value from standard to max?
> What max size of Open MPI jobs have been seen at all?
> What max size of Open MPI jobs *using MPI_Alltoallv* have been seen at all?
> Is there a way to manage the size/the number of queue pairs? (XRC not
> availabe)
> Is there a way to tell MPI_Alltoallv to use less queue pairs, even when
> this could lead to slow-down?
>
> There is a suspicious parameter in the mlx4_core module:
> $ modinfo mlx4_core | grep log_num_cq
> parm:           log_num_cq:log maximum number of CQs per HCA  (int)
>
> Is this the tweakable parameter?
> What is the default, and max value?
>
> Any help would be welcome...
>
> Best,
>
> Paul Kapinos
>
> P.S. There should be no connection problen somewhere between the nodes; a
> test job with 1x process on each node has been ran sucessfully just before
> starting the actual job, which also ran OK for a while - until calling
> MPI_Alltoallv.
>
>
>
>
>
>
> ------------------------------**------------------------------**
> --------------
> A process failed to create a queue pair. This usually means either
> the device has run out of queue pairs (too many connections) or
> there are insufficient resources available to allocate a queue pair
> (out of memory). The latter can happen if either 1) insufficient
> memory is available, or 2) no more physical memory can be registered
> with the device.
>
> For more information on memory registration see the Open MPI FAQs at:
> http://www.open-mpi.org/faq/?**category=openfabrics#ib-**locked-pages<http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages>
>
> Local host:             linuxbmc1156.rz.RWTH-Aachen.DE
> Local device:           mlx4_0
> Queue pair type:        Reliable connected (RC)
> ------------------------------**------------------------------**
> --------------
> [linuxbmc1156.rz.RWTH-Aachen.**DE <http://linuxbmc1156.rz.RWTH-Aachen.DE>
> ][[3703,1],4021][connect/**btl_openib_connect_oob.c:867:**rml_recv_cb]
> error in endpoint reply start connect
> [linuxbmc1156.rz.RWTH-Aachen.**DE:9632<http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>]
> *** An error occurred in MPI_Alltoallv
> [linuxbmc1156.rz.RWTH-Aachen.**DE:9632<http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>]
> *** on communicator MPI_COMM_WORLD
> [linuxbmc1156.rz.RWTH-Aachen.**DE:9632<http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>]
> *** MPI_ERR_OTHER: known error not in list
> [linuxbmc1156.rz.RWTH-Aachen.**DE:9632<http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>]
> *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> [linuxbmc1156.rz.RWTH-Aachen.**DE <http://linuxbmc1156.rz.RWTH-Aachen.DE>
> ][[3703,1],4024][connect/**btl_openib_connect_oob.c:867:**rml_recv_cb]
> error in endpoint reply start connect
> [linuxbmc1156.rz.RWTH-Aachen.**DE <http://linuxbmc1156.rz.RWTH-Aachen.DE>
> ][[3703,1],4027][connect/**btl_openib_connect_oob.c:867:**rml_recv_cb]
> error in endpoint reply start connect
> [linuxbmc0840.rz.RWTH-Aachen.**DE <http://linuxbmc0840.rz.RWTH-Aachen.DE>
> ][[3703,1],10][connect/btl_**openib_connect_oob.c:867:rml_**recv_cb]
> error in endpoint reply start connect
> [linuxbmc0840.rz.RWTH-Aachen.**DE <http://linuxbmc0840.rz.RWTH-Aachen.DE>
> ][[3703,1],1][connect/btl_**openib_connect_oob.c:867:rml_**recv_cb] error
> in endpoint reply start connect
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>]
> [[3703,0],0]-[[3703,1],10] mca_oob_tcp_msg_recv: readv failed: Connection
> reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>]
> [[3703,0],0]-[[3703,1],8] mca_oob_tcp_msg_recv: readv failed: Connection
> reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>]
> [[3703,0],0]-[[3703,1],9] mca_oob_tcp_msg_recv: readv failed: Connection
> reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>]
> [[3703,0],0]-[[3703,1],1] mca_oob_tcp_msg_recv: readv failed: Connection
> reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>]
> 9 more processes have sent help message help-mpi-btl-openib-cpc-base.**txt
> / ibv_create_qp failed
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>]
> Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error
> messages
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696<http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>]
> 3 more processes have sent help message help-mpi-errors.txt /
> mpi_errors_are_fatal
>
> --
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to