Hi,
I am using OpenMPI 1.8.4 on a Ubuntu 14.04 machine and 5 Ubuntu 12.04
machines. I am using ssh to launch MPI jobs and I'm able to run simple
programs like 'mpirun -np 8 --host localhost,pachy1 hostname' and get
the expected output (pachy1 being an entry in my /etc/hosts file).
I started using MPI_Comm_spawn in my app with the intent of NOT calling
mpirun to launch the program that calls MPI_Comm_spawn (my attempt at
using the Singleton MPI_INIT pattern described in 10.5.2 of MPI 3.0
standard). The app needs to launch an MPI job of a given size from a
given hostfile, where the job needs to report some info back to the app,
so it seemed MPI_Comm_spawn was my best bet. The app is only rarely
going to be used this way, thus mpirun not being used to launch the app
that is the parent in the MPI_Comm_spawn operation. This pattern works
fine if the only entries in the hostfile are 'localhost'. However if I
add a host that isn't local I get a segmentation fault from the orted
process.
In any case, I distilled my example down as small as I could. I've
attached the C code of the master and the hostfile I'm using. Here's the
output:
evan@lasarti:~/devel/toy_progs/mpi_spawn$ ./master
~/mpi/test_distributed.hostfile
[lasarti:32020] [[21014,1],0] FORKING HNP: orted --hnp --set-sid
--report-uri 14 --singleton-died-pipe 15 -mca state_novm_select 1 -mca
ess_base_jobid 1377173504
[lasarti:32022] *** Process received signal ***
[lasarti:32022] Signal: Segmentation fault (11)
[lasarti:32022] Signal code: Address not mapped (1)
[lasarti:32022] Failing at address: (nil)
[lasarti:32022] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f07af039340]
[lasarti:32022] [ 1]
/opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc191_hwloc_get_obj_by_depth+0x32)[0x7f07aea227c2]
[lasarti:32022] [ 2]
/opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_hwloc_base_get_nbobjs_by_type+0x90)[0x7f07ae9f5430]
[lasarti:32022] [ 3]
/opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(orte_rmaps_rr_byobj+0x134)[0x7f07ab2fb154]
[lasarti:32022] [ 4]
/opt/openmpi-1.8.4/lib/openmpi/mca_rmaps_round_robin.so(+0x12c6)[0x7f07ab2fa2c6]
[lasarti:32022] [ 5]
/opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_rmaps_base_map_job+0x21a)[0x7f07af299f7a]
[lasarti:32022] [ 6]
/opt/openmpi-1.8.4/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0x6e4)[0x7f07ae9e7034]
[lasarti:32022] [ 7]
/opt/openmpi-1.8.4/lib/libopen-rte.so.7(orte_daemon+0xdff)[0x7f07af27a86f]
[lasarti:32022] [ 8] orted(main+0x47)[0x400877]
[lasarti:32022] [ 9]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f07aec84ec5]
[lasarti:32022] [10] orted[0x4008cb]
[lasarti:32022] *** End of error message ***
If I launch 'master.c' using mpirun, I don't get a segmentation fault,
but it doesn't seem to be launching the process on anything more than
localhost, no matter what hostfile I give it.
For what it's worth, I fully expected to debug some path issues
regarding the binary I wanted to launch with MPI_Comm_spawn when I used
this distributed, but this error at first glance doesn't appear to have
anything to do with that. I'm sure this is something silly I'm doing
wrong, but I don't really know how to debug this further given this error.
Evan
P.S. Only including zipped config.log since the "ompi_info -v ompi full
--parsable" command I got from http://www.open-mpi.org/community/help/
doesn't seem to work anymore.
#include "mpi.h"
#include
int main(int argc, char **argv) {
int rc;
MPI_Init(&argc, &argv);
MPI_Info the_info;
rc = MPI_Info_create(&the_info);
assert(rc == MPI_SUCCESS);
// I tried both (with appropriately different argv[1])...same result.
#if 1
rc = MPI_Info_set(the_info, "hostfile", argv[1]);
assert(rc == MPI_SUCCESS);
#else
rc = MPI_Info_set(the_info, "host", argv[1]);
assert(rc == MPI_SUCCESS);
#endif
MPI_Comm the_group;
rc = MPI_Comm_spawn("hostname",
MPI_ARGV_NULL,
8,
the_info,
0,
MPI_COMM_WORLD,
&the_group,
MPI_ERRCODES_IGNORE);
assert(rc == MPI_SUCCESS);
MPI_Finalize();
return 0;
}
localhost
pachy1
config.log.tar.bz2
Description: application/bzip