your program works fine on my environment.

this is typical of a firewall running on your host(s), can you double check that ?

a simple way to do that is to
10.10.10.11# nc -l 1024

and on the other node
echo ahah | nc 10.10.10.11 1024

the first command should print "ahah" unless the host is unreachable and/or the tcp connection is denied by the firewall.

Cheers,

Gilles


On 4/4/2016 9:44 AM, dpchoudh . wrote:
Hello Gilles

Thanks for your help.

My question was more of a sanity check on myself. That little program I sent looked correct to me; do you see anything wrong with it?

What I am running on my setup is an instrumented OMPI stack, taken from git HEAD, in an attempt to understand how some of the internals work. If you think the code is correct, it is quite possible that one of those 'instrumentations' is causing this.

And BTW, adding -mca pml ob1 makes the code hang at MPI_Send (as opposed to MPI_Recv())

[smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node 10.10.10.11 [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node 10.10.10.11 [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node 10.10.10.11 [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node 10.10.10.11 [smallMPI:51673] btl: tcp: attempting to connect() to [[51894,1],1] address 10.10.10.11 on port 1024 <--- Hangs here

But 10.10.10.11 is pingable:
[durga@smallMPI ~]$ ping bigMPI
PING bigMPI (10.10.10.11) 56(84) bytes of data.
64 bytes from bigMPI (10.10.10.11): icmp_seq=1 ttl=64 time=0.247 ms


We learn from history that we never learn from history.

On Sun, Apr 3, 2016 at 8:04 PM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

    Hi,

    per a previous message, can you give a try to
    mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp --mca pml ob1
    ./mpitest

    if it still hangs, the issue could be OpenMPI think some subnets
    are reachable but they are not.

    for diagnostic :
    mpirun --mca btl_base_verbose 100 ...

    you can explicitly include/exclude subnets with
    --mca btl_tcp_if_include xxx
    or
    --mca btl_tcp_if_exclude yyy

    for example,
    mpirun --mca btl_btp_if_include 192.168.0.0/24
    <http://192.168.0.0/24> -np 2 -hostfile ~/hostfile --mca btl
    self,tcp --mca pml ob1 ./mpitest
    should do the trick

    Cheers,

    Gilles




    On 4/4/2016 8:32 AM, dpchoudh . wrote:
    Hello all

    I don't mean to be competing for the 'silliest question of the
    year award', but I can't figure this out on my own:

    My 'cluster' has 2 machines, bigMPI and smallMPI. They are
    connected via several (types of) networks and the connectivity is OK.

    In this setup, the following program hangs after printing

    Hello world from processor smallMPI, rank 0 out of 2 processors
    Hello world from processor bigMPI, rank 1 out of 2 processors
    smallMPI sent haha!


    Obviously it is hanging at MPI_Recv(). But why? My command line
    is as follows, but this happens if I try openib BTL (instead of
    TCP) as well.

    mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp ./mpitest

    It must be something *really* trivial, but I am drawing a blank
    right now.

    Please help!

    #include <mpi.h>
    #include <stdio.h>
    #include <string.h>

    int main(int argc, char** argv)
    {
        int world_size, world_rank, name_len;
        char hostname[MPI_MAX_PROCESSOR_NAME], buf[8];

        MPI_Init(&argc, &argv);
        MPI_Comm_size(MPI_COMM_WORLD, &world_size);
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
        MPI_Get_processor_name(hostname, &name_len);
        printf("Hello world from processor %s, rank %d out of %d
    processors\n", hostname, world_rank, world_size);
        if (world_rank == 1)
        {
        MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD,
    MPI_STATUS_IGNORE);
        printf("%s received %s\n", hostname, buf);
        }
        else
        {
        strcpy(buf, "haha!");
        MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
        printf("%s sent %s\n", hostname, buf);
        }
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_Finalize();
        return 0;
    }



    We learn from history that we never learn from history.


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this 
post:http://www.open-mpi.org/community/lists/users/2016/04/28876.php


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2016/04/28877.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/04/28878.php

Reply via email to