Hello Gilles Thanks for your help.
My question was more of a sanity check on myself. That little program I sent looked correct to me; do you see anything wrong with it? What I am running on my setup is an instrumented OMPI stack, taken from git HEAD, in an attempt to understand how some of the internals work. If you think the code is correct, it is quite possible that one of those 'instrumentations' is causing this. And BTW, adding -mca pml ob1 makes the code hang at MPI_Send (as opposed to MPI_Recv()) [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node 10.10.10.11 [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node 10.10.10.11 [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node 10.10.10.11 [smallMPI:51673] mca: bml: Using tcp btl for send to [[51894,1],1] on node 10.10.10.11 [smallMPI:51673] btl: tcp: attempting to connect() to [[51894,1],1] address 10.10.10.11 on port 1024 <--- Hangs here But 10.10.10.11 is pingable: [durga@smallMPI ~]$ ping bigMPI PING bigMPI (10.10.10.11) 56(84) bytes of data. 64 bytes from bigMPI (10.10.10.11): icmp_seq=1 ttl=64 time=0.247 ms We learn from history that we never learn from history. On Sun, Apr 3, 2016 at 8:04 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > Hi, > > per a previous message, can you give a try to > mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp --mca pml ob1 ./mpitest > > if it still hangs, the issue could be OpenMPI think some subnets are > reachable but they are not. > > for diagnostic : > mpirun --mca btl_base_verbose 100 ... > > you can explicitly include/exclude subnets with > --mca btl_tcp_if_include xxx > or > --mca btl_tcp_if_exclude yyy > > for example, > mpirun --mca btl_btp_if_include 192.168.0.0/24 -np 2 -hostfile ~/hostfile > --mca btl self,tcp --mca pml ob1 ./mpitest > should do the trick > > Cheers, > > Gilles > > > > > On 4/4/2016 8:32 AM, dpchoudh . wrote: > > Hello all > > I don't mean to be competing for the 'silliest question of the year > award', but I can't figure this out on my own: > > My 'cluster' has 2 machines, bigMPI and smallMPI. They are connected via > several (types of) networks and the connectivity is OK. > > In this setup, the following program hangs after printing > > Hello world from processor smallMPI, rank 0 out of 2 processors > Hello world from processor bigMPI, rank 1 out of 2 processors > smallMPI sent haha! > > > Obviously it is hanging at MPI_Recv(). But why? My command line is as > follows, but this happens if I try openib BTL (instead of TCP) as well. > > mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp ./mpitest > > It must be something *really* trivial, but I am drawing a blank right now. > > Please help! > > #include <mpi.h> > #include <stdio.h> > #include <string.h> > > int main(int argc, char** argv) > { > int world_size, world_rank, name_len; > char hostname[MPI_MAX_PROCESSOR_NAME], buf[8]; > > MPI_Init(&argc, &argv); > MPI_Comm_size(MPI_COMM_WORLD, &world_size); > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); > MPI_Get_processor_name(hostname, &name_len); > printf("Hello world from processor %s, rank %d out of %d > processors\n", hostname, world_rank, world_size); > if (world_rank == 1) > { > MPI_Recv(buf, 6, MPI_CHAR, 0, 99, MPI_COMM_WORLD, MPI_STATUS_IGNORE); > printf("%s received %s\n", hostname, buf); > } > else > { > strcpy(buf, "haha!"); > MPI_Send(buf, 6, MPI_CHAR, 1, 99, MPI_COMM_WORLD); > printf("%s sent %s\n", hostname, buf); > } > MPI_Barrier(MPI_COMM_WORLD); > MPI_Finalize(); > return 0; > } > > > > We learn from history that we never learn from history. > > > _______________________________________________ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/04/28876.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/04/28877.php >