Vanilla Linux ofed from RPM's for Scientific Linux release 6.4 (Carbon) (= RHEL 6.4).
No ofed_info available :-(

On 07/31/13 16:59, Mike Dubman wrote:
Hi,
What OFED vendor and version do you use?
Regards
M


On Tue, Jul 30, 2013 at 8:42 PM, Paul Kapinos <kapi...@rz.rwth-aachen.de
<mailto:kapi...@rz.rwth-aachen.de>> wrote:

    Dear Open MPI experts,

    An user at our cluster has a problem running a kinda of big job:
    (- the job using 3024 processes (12 per node, 252 nodes) runs fine)
    - the job using 4032 processes (12 per node, 336 nodes) produce the error
    attached below.

    Well, the
    http://www.open-mpi.org/faq/?__category=openfabrics#ib-__locked-pages
    <http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages> is
    well-known one; both recommended tweakables (user limits and registered
    memory size) are at MAX now, nevertheless someone queue pair could not be
    created.

    Our blind guess is the number of completion queues is exhausted.

    What happen' when raising the value from standard to max?
    What max size of Open MPI jobs have been seen at all?
    What max size of Open MPI jobs *using MPI_Alltoallv* have been seen at all?
    Is there a way to manage the size/the number of queue pairs? (XRC not 
availabe)
    Is there a way to tell MPI_Alltoallv to use less queue pairs, even when this
    could lead to slow-down?

    There is a suspicious parameter in the mlx4_core module:
    $ modinfo mlx4_core | grep log_num_cq
    parm:           log_num_cq:log maximum number of CQs per HCA  (int)

    Is this the tweakable parameter?
    What is the default, and max value?

    Any help would be welcome...

    Best,

    Paul Kapinos

    P.S. There should be no connection problen somewhere between the nodes; a
    test job with 1x process on each node has been ran sucessfully just before
    starting the actual job, which also ran OK for a while - until calling
    MPI_Alltoallv.






    
------------------------------__------------------------------__--------------
    A process failed to create a queue pair. This usually means either
    the device has run out of queue pairs (too many connections) or
    there are insufficient resources available to allocate a queue pair
    (out of memory). The latter can happen if either 1) insufficient
    memory is available, or 2) no more physical memory can be registered
    with the device.

    For more information on memory registration see the Open MPI FAQs at:
    http://www.open-mpi.org/faq/?__category=openfabrics#ib-__locked-pages
    <http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages>

    Local host: linuxbmc1156.rz.RWTH-Aachen.DE
    <http://linuxbmc1156.rz.RWTH-Aachen.DE>
    Local device:           mlx4_0
    Queue pair type:        Reliable connected (RC)
    
------------------------------__------------------------------__--------------
    [linuxbmc1156.rz.RWTH-Aachen.__DE
    
<http://linuxbmc1156.rz.RWTH-Aachen.DE>][[3703,1],4021][connect/__btl_openib_connect_oob.c:867:__rml_recv_cb]
    error in endpoint reply start connect
    [linuxbmc1156.rz.RWTH-Aachen.__DE:9632
    <http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] *** An error occurred in
    MPI_Alltoallv
    [linuxbmc1156.rz.RWTH-Aachen.__DE:9632
    <http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] *** on communicator 
MPI_COMM_WORLD
    [linuxbmc1156.rz.RWTH-Aachen.__DE:9632
    <http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] *** MPI_ERR_OTHER: known error
    not in list
    [linuxbmc1156.rz.RWTH-Aachen.__DE:9632
    <http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] *** MPI_ERRORS_ARE_FATAL: your
    MPI job will now abort
    [linuxbmc1156.rz.RWTH-Aachen.__DE
    
<http://linuxbmc1156.rz.RWTH-Aachen.DE>][[3703,1],4024][connect/__btl_openib_connect_oob.c:867:__rml_recv_cb]
    error in endpoint reply start connect
    [linuxbmc1156.rz.RWTH-Aachen.__DE
    
<http://linuxbmc1156.rz.RWTH-Aachen.DE>][[3703,1],4027][connect/__btl_openib_connect_oob.c:867:__rml_recv_cb]
    error in endpoint reply start connect
    [linuxbmc0840.rz.RWTH-Aachen.__DE
    
<http://linuxbmc0840.rz.RWTH-Aachen.DE>][[3703,1],10][connect/btl___openib_connect_oob.c:867:rml___recv_cb]
    error in endpoint reply start connect
    [linuxbmc0840.rz.RWTH-Aachen.__DE
    
<http://linuxbmc0840.rz.RWTH-Aachen.DE>][[3703,1],1][connect/btl___openib_connect_oob.c:867:rml___recv_cb]
    error in endpoint reply start connect
    [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
    <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] [[3703,0],0]-[[3703,1],10]
    mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
    [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
    <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] [[3703,0],0]-[[3703,1],8]
    mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
    [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
    <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] [[3703,0],0]-[[3703,1],9]
    mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
    [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
    <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] [[3703,0],0]-[[3703,1],1]
    mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
    [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
    <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] 9 more processes have sent
    help message help-mpi-btl-openib-cpc-base.__txt / ibv_create_qp failed
    [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
    <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] Set MCA parameter
    "orte_base_help_aggregate" to 0 to see all help / error messages
    [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
    <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] 3 more processes have sent
    help message help-mpi-errors.txt / mpi_errors_are_fatal

    --
    Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
    RWTH Aachen University, Center for Computing and Communication
    Seffenter Weg 23,  D 52074  Aachen (Germany)
    Tel: +49 241/80-24915 <tel:%2B49%20241%2F80-24915>


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to