Hi,

I am using OpenMPI 1.8.4 on a Ubuntu 14.04 machine and 5 Ubuntu 12.04 machines. I am using ssh to launch MPI jobs and I'm able to run simple programs like 'mpirun -np 8 --host localhost,pachy1 hostname' and get the expected output (pachy1 being an entry in my /etc/hosts file).

I started using MPI_Comm_spawn in my app with the intent of NOT calling mpirun to launch the program that calls MPI_Comm_spawn (my attempt at using the Singleton MPI_INIT pattern described in 10.5.2 of MPI 3.0 standard). The app needs to launch an MPI job of a given size from a given hostfile, where the job needs to report some info back to the app, so it seemed MPI_Comm_spawn was my best bet. The app is only rarely going to be used this way, thus mpirun not being used to launch the app that is the parent in the MPI_Comm_spawn operation. This pattern works fine if the only entries in the hostfile are 'localhost'. However if I add a host that isn't local I get a segmentation fault from the orted process.

In any case, I distilled my example down as small as I could. I've attached the C code of the master and the hostfile I'm using. Here's the output:

evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./master ~/mpi/test_distributed.hostfile [lasarti:32020] [[21014,1],0] FORKING HNP: orted --hnp --set-sid --report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca ess_base_jobid 1377173504
[lasarti:32022] *** Process received signal ***
[lasarti:32022] Signal: Segmentation fault (11)
[lasarti:32022] Signal code: Address not mapped (1)
[lasarti:32022] Failing at address: (nil)
[lasarti:32022] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f07af039340] [lasarti:32022] [ 1] /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc191_hwloc_get_obj_by_depth+0x32)[0x7f07aea227c2] [lasarti:32022] [ 2] /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc_base_get_nbobjs_by_type+0x90)[0x7f07ae9f5430] [lasarti:32022] [ 3] /opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(orte_rmaps_rr_byobj+0x134)[0x7f07ab2fb154] [lasarti:32022] [ 4] /opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(+0x12c6)[0x7f07ab2fa2c6] [lasarti:32022] [ 5] /opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x21a)[0x7f07af299f7a] [lasarti:32022] [ 6] /opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7f07ae9e7034] [lasarti:32022] [ 7] /opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_daemon+0xdff)[0x7f07af27a86f]
[lasarti:32022] [ 8] orted(main+0x47)[0x400877]
[lasarti:32022] [ 9] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f07aec84ec5]
[lasarti:32022] [10] orted[0x4008cb]
[lasarti:32022] *** End of error message ***

If I launch 'master.c' using mpirun, I don't get a segmentation fault, but it doesn't seem to be launching the process on anything more than localhost, no matter what hostfile I give it.

For what it's worth, I fully expected to debug some path issues regarding the binary I wanted to launch with MPI_Comm_spawn when I used this distributed, but this error at first glance doesn't appear to have anything to do with that. I'm sure this is something silly I'm doing wrong, but I don't really know how to debug this further given this error.

Evan

P.S. Only including zipped config.log since the "ompi_info -v ompi full --parsable" command I got from http://www.open-mpi.org/community/help/ doesn't seem to work anymore.


#include "mpi.h"
#include <assert.h>

int main(int argc, char **argv) {
  int rc;
  MPI_Init(&argc, &argv);

  MPI_Info the_info;
  rc = MPI_Info_create(&the_info);
  assert(rc == MPI_SUCCESS);

  // I tried both (with appropriately different argv[1])...same result.
#if 1
  rc = MPI_Info_set(the_info, "hostfile", argv[1]);
  assert(rc == MPI_SUCCESS);
#else
  rc = MPI_Info_set(the_info, "host", argv[1]);
  assert(rc == MPI_SUCCESS);
#endif

  MPI_Comm the_group;
  rc = MPI_Comm_spawn("hostname",
                 MPI_ARGV_NULL,
                 8,
                 the_info,
                 0,
                 MPI_COMM_WORLD,
                 &the_group,
                 MPI_ERRCODES_IGNORE);
  assert(rc == MPI_SUCCESS);

  MPI_Finalize();
  return 0;
}
localhost
pachy1

Attachment: config.log.tar.bz2
Description: application/bzip

Reply via email to