if you are running with master, i recommend you
mpirun --mca mpi_add_procs_cutoff 1024 ...

in order to avoid the crash i just reported at https://github.com/open-mpi/ompi/issues/1501

Cheers,

Gilles

On 3/28/2016 4:44 PM, Gilles Gouaillardet wrote:
at first, does it hang when running on only one node ?

when the hang occur, you can collect stack traces
(run pstack on mpitest)
to see where it hangs.

since you configure'd with --disable-dlopen, it means your btl has been slurped into openmpi. that means some parts of it are executed, and it could be responsible for the hang.

note if you
mpirun --mca btl self,sm -np 2 ...
on two hosts, then it will never work since no btl can be used so mpi tasks can communicate.

it seems something is wrong on master, i will check from now

do smallMPI and bigMPI implies on host is little endian and the other is big endian ?
if yes, then you need to configure with --enable-heterogeneous

Cheers,

Gilles

On 3/28/2016 4:26 PM, dpchoudh . wrote:
Hello Gilles

Per your suggestion, installing libnl3-devel does fixes the mpicc issue, but there still seems to be another issue down the road: the generated executable seems to hang. I have tried sm, tcp and openib BTLs, all with the same result:

[durga@smallMPI ~]$ mpirun -np 2 -H smallMPI,bigMPI -mca btl self,sm ./mpitest <--- Hangs

The source code for the simple test is as follows:

#include <mpi.h>
#include <stdio.h>
#include <unistd.h>

int main(int argc, char** argv)
{
    int world_size, world_rank, name_len;
    char hostname[MPI_MAX_PROCESSOR_NAME];
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);
    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
    MPI_Get_processor_name(hostname, &name_len);
printf("Hello world from processor %s, rank %d out of %d processors\n", hostname, world_rank, world_size);
    MPI_Finalize();
    return 0;
}


What do I do now?

Thanks
Durga

We learn from history that we never learn from history.

On Mon, Mar 28, 2016 at 2:37 AM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

    Does this happen only with master ?

    what does
    ldd mpicc
    says ?
    does it require both libnl and libnl3 ?

    libnl3 is used by OpenMPI if libnl3-devel package is installed,
    and this is not the case on your system

    a possible root cause is third party libs use libnl3, and the
    reachable/netlink component
    tries to use libnl, in this case, installing libnl3-devel should
    fix your issue
    /* you will need to re-configure after that */

    an other possible root cause is some third party libs use libnl
    and other use libnl3,
    and in this case, i am afraid there is no simple workaround.
    if installing libnl3-devel did not solve your issue, you can give
    a try to
    https://github.com/open-mpi/ompi/pull/1014
    at least, it will abort with an error message that states which
    lib is using libnl and which is using libnl3

    i am afraid the only option is to manually disable some
    components, so only one flavor of lib nl is used.
    that can be achieved by adding a .opal_ignore empty file in the
    dir of the components you want to disable.
    /* you will need to rerun autogen.pl <http://autogen.pl> after
    that */

    Cheers,

    Gilles

    On 3/28/2016 3:16 PM, dpchoudh . wrote:
    Hello all

    The system in question is a CentOS 7 box, that has been running
    OpenMPI, both the master branch and the 1.10.2 release happily
    until now.

    Just now, in order to debug something, I recompiled with the
    following options:

    $ ./configure --enable-debug --enable-debug-symbols --disable-dlopen

    The compilation and install was successful; however, mpicc now
    crashes like this:

    [durga@smallMPI ~]$ mpicc -Wall -Wextra -o mpitest mpitest.c
    mpicc: route/tc.c:973: rtnl_tc_register: Assertion `0' failed.
    Aborted (core dumped)


    Searching the mailing archive, I found two posts that describe
    similar situations:

    https://www.open-mpi.org/community/lists/devel/2015/08/17812.php
    http://www.open-mpi.org/community/lists/users/2015/11/28016.php

    However, the solution proposed in these, to disable verbs, is
    not acceptable to me for the following reasons: I am trying to
    implement a new BTL by reverse engineering the openib BTL. I am
    using a Qlogic HCA for this purpose. (Please note that I cannot
    use PSM as I am writing code for a BTL)

    As there any more acceptable solutions for this? Here are the
    list of nl libraries on my box:

    [durga@smallMPI ~]$ sudo yum list installed | grep libnl
    libnl.x86_64 1.1.4-3.el7                     @anaconda
    libnl-devel.x86_64 1.1.4-3.el7                     @anaconda
    libnl3.x86_64 3.2.21-10.el7                   @base
    libnl3-cli.x86_64 3.2.21-10.el7                   @base

and uninstalling libnl3 is not an option either: it seems yum wants to uninstall around 100 odd other packages because of
    dependency which will essentially render the machine unusable.

     Please help!

    Thanks in advance
    Durga

    We learn from history that we never learn from history.


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this 
post:http://www.open-mpi.org/community/lists/users/2016/03/28855.php


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2016/03/28856.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2016/03/28859.php



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/03/28860.php

Reply via email to