[OMPI users] orted seg fault when using MPI_Comm_spawn on more than one host

2015-01-26 Thread Evan

Hi,

I am using OpenMPI 1.8.4 on a Ubuntu 14.04 machine and 5 Ubuntu 12.04 
machines.  I am using ssh to launch MPI jobs and I'm able to run simple 
programs like 'mpirun -np 8 --host localhost,pachy1 hostname' and get 
the expected output (pachy1 being an entry in my /etc/hosts file).


I started using MPI_Comm_spawn in my app with the intent of NOT calling 
mpirun to launch the program that calls MPI_Comm_spawn (my attempt at 
using the Singleton MPI_INIT pattern described in 10.5.2 of MPI 3.0 
standard).  The app needs to launch an MPI job of a given size from a 
given hostfile, where the job needs to report some info back to the app, 
so it seemed MPI_Comm_spawn was my best bet.  The app is only rarely 
going to be used this way, thus mpirun not being used to launch the app 
that is the parent in the MPI_Comm_spawn operation.  This pattern works 
fine if the only entries in the hostfile are 'localhost'.  However if I 
add a host that isn't local I get a segmentation fault from the orted 
process.


In any case, I distilled my example down as small as I could.  I've 
attached the C code of the master and the hostfile I'm using. Here's the 
output:


evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./master 
~/mpi/test_distributed.hostfile
[lasarti:32020] [[21014,1],0] FORKING HNP: orted --hnp --set-sid 
--report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca 
ess_base_jobid 1377173504

[lasarti:32022] *** Process received signal ***
[lasarti:32022] Signal: Segmentation fault (11)
[lasarti:32022] Signal code: Address not mapped (1)
[lasarti:32022] Failing at address: (nil)
[lasarti:32022] [ 0] 
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f07af039340]
[lasarti:32022] [ 1] 
/opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc191_hwloc_get_obj_by_depth+0x32)[0x7f07aea227c2]
[lasarti:32022] [ 2] 
/opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc_base_get_nbobjs_by_type+0x90)[0x7f07ae9f5430]
[lasarti:32022] [ 3] 
/opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(orte_rmaps_rr_byobj+0x134)[0x7f07ab2fb154]
[lasarti:32022] [ 4] 
/opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(+0x12c6)[0x7f07ab2fa2c6]
[lasarti:32022] [ 5] 
/opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x21a)[0x7f07af299f7a]
[lasarti:32022] [ 6] 
/opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7f07ae9e7034]
[lasarti:32022] [ 7] 
/opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_daemon+0xdff)[0x7f07af27a86f]

[lasarti:32022] [ 8] orted(main+0x47)[0x400877]
[lasarti:32022] [ 9] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f07aec84ec5]

[lasarti:32022] [10] orted[0x4008cb]
[lasarti:32022] *** End of error message ***

If I launch 'master.c' using mpirun, I don't get a segmentation fault, 
but it doesn't seem to be launching the process on anything more than 
localhost, no matter what hostfile I give it.


For what it's worth, I fully expected to debug some path issues 
regarding the binary I wanted to launch with MPI_Comm_spawn when I used 
this distributed, but this error at first glance doesn't appear to have 
anything to do with that.  I'm sure this is something silly I'm doing 
wrong, but I don't really know how to debug this further given this error.


Evan

P.S. Only including zipped config.log since the "ompi_info -v ompi full 
--parsable" command I got from http://www.open-mpi.org/community/help/ 
doesn't seem to work anymore.



#include "mpi.h"
#include 

int main(int argc, char **argv) {
  int rc;
  MPI_Init(&argc, &argv);

  MPI_Info the_info;
  rc = MPI_Info_create(&the_info);
  assert(rc == MPI_SUCCESS);

  // I tried both (with appropriately different argv[1])...same result.
#if 1
  rc = MPI_Info_set(the_info, "hostfile", argv[1]);
  assert(rc == MPI_SUCCESS);
#else
  rc = MPI_Info_set(the_info, "host", argv[1]);
  assert(rc == MPI_SUCCESS);
#endif

  MPI_Comm the_group;
  rc = MPI_Comm_spawn("hostname",
 MPI_ARGV_NULL,
 8,
 the_info,
 0,
 MPI_COMM_WORLD,
 &the_group,
 MPI_ERRCODES_IGNORE);
  assert(rc == MPI_SUCCESS);

  MPI_Finalize();
  return 0;
}
localhost
pachy1


config.log.tar.bz2
Description: application/bzip


[OMPI users] TCP BTL and virtual network interfaces, bug #3339

2015-01-26 Thread Kris Kersten
I'm working on an ethernet cluster that uses virtual eth0:* interfaces on the 
compute nodes for IPMI and system management.  As described in Trac ticket 
#3339 (https://svn.open-mpi.org/trac/ompi/ticket/3339 ), this setup confuses 
the TCP BTL which can't differentiate between the physical and virtual 
interfaces.  Verbose BTL output confirms this, showing attempted communication 
on both the physical and virtual IP addresses followed by a hang.

Has there been any progress on this bug?  Or has anyone managed to figure out a 
workaround?

Thanks,
Kris



Re: [OMPI users] TCP BTL and virtual network interfaces, bug #3339

2015-01-26 Thread George Bosilca
Use mpirun --mca btl_tcp_if_exclude eth0 should fix your problem. Otherwise
you can add it to your configuration file. Everything is extensively
described in the FAQ.

George.
On Jan 26, 2015 12:11 PM, "Kris Kersten"  wrote:

>  I'm working on an ethernet cluster that uses virtual eth0:* interfaces
> on the compute nodes for IPMI and system management.  As described in Trac
> ticket #3339 (https://svn.open-mpi.org/trac/ompi/ticket/3339 ), this
> setup confuses the TCP BTL which can't differentiate between the physical
> and virtual interfaces.  Verbose BTL output confirms this, showing
> attempted communication on both the physical and virtual IP addresses
> followed by a hang.
>
>
>
> Has there been any progress on this bug?  Or has anyone managed to figure
> out a workaround?
>
>
>
> Thanks,
>
> Kris
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26236.php
>


Re: [OMPI users] TCP BTL and virtual network interfaces, bug #3339

2015-01-26 Thread Kris Kersten
There is no high-speed network, only eth0.  So MPI communication must be TCP 
over eth0.  I have tried forcing eth0 with --mca btl_tcp_if_include eth0, and 
also by specifying the eth0 subnet.  (Looking at the btl_tcp_component.c 
source, I see that the subnet is just translated back into the interface name, 
so these are equivalent.)

The problem is that including eth0 does not exclude the virtual interfaces 
(eth0:1 and eth0:5 in my case).  According to the bug, the linux kernel assigns 
the same interface index to both the physical and virtual interfaces.  Because 
the TCP BTL uses this kernel index to choose the interface, it can't 
distinguish between physical and virtual interfaces.  I can see this play out 
in the verbose TCP BTL output, with oob and TCP communication happening over 
all three subnets, rather than just the eth0 subnet.  This results in a hang.

I'm looking into the possibility of using tun/tap interfaces for IMPI and 
system management, but I'm not sure if that's a possibility.  There is a 
mention of using tun/tap for MPI in the bug report, but I don't know what 
overhead that would have.  I was hoping that someone might have come up with 
some other solution.

Thanks,
Kris


> From: George Bosilca (bosilca_at_[hidden])
> Date: 2015-01-26 15:19:40
> 
> Use mpirun --mca btl_tcp_if_exclude eth0 should fix your problem. Otherwise
> you can add it to your configuration file. Everything is extensively
> described in the FAQ.
> 
> George.
> On Jan 26, 2015 12:11 PM, "Kris Kersten"  wrote:
> 
> > I'm working on an ethernet cluster that uses virtual eth0:* interfaces
> > on the compute nodes for IPMI and system management.  As described in Trac
> > ticket #3339 (https://svn.open-mpi.org/trac/ompi/ticket/3339 ), this
> > setup confuses the TCP BTL which can't differentiate between the physical
> > and virtual interfaces.  Verbose BTL output confirms this, showing
> > attempted communication on both the physical and virtual IP addresses
> > followed by a hang.
> >
> >
> >
> > Has there been any progress on this bug?  Or has anyone managed to figure
> > out a workaround?
> >
> >
> >
> > Thanks,
> >
> > Kris