Dear colleagues,
I need some help controlling where a process spawned with
MPI_Comm_spawn goes.  I am in openmpi-1.10 under Centos 6.7.
My application is written in C and am running on a RedBarn
system with a master node (hardware box) that connects to the
outside world and two other nodes connected to it via ethernet and
Infiniband.  There are two executable files, one (I'll call it
"Rank0Pgm") that expects to be rank 0 and does all the I/O and
the other ("RanknPgm") that only communicates via MPI messages.
There are two MPI_Comm_spawns that run just after MPI_Init and
an initial broadcast that shares some setup info, like this:
MPI_Comm_spawn("andmsg", argv, 1, MPI_INFO_NULL,
   hostid, commc, &commd, &sperr);
where "andmsg" is a program that needs to communicate with the
internet and with all the other processes via a new communicator
that will be called commd (and another name for the other one).
   When I run this program with no hostfile and an mpirun line
something like this on a node with 32 cores:
/usr/lib64/openmpi-1.10/bin/mpirun -n 1 Rank0Pgm : -n 28 RanknPgm \
   < InputFile
everything works fine.  I assume the spawns use 2 of the 3 available
cores that I did not ask the program to use.

Now I want to run on the full network, so I make a hostfile like this
(call it "nodes120"):
node0 slots=22 max-slots=22
n0003 slots=40 max-slots=40
n0004 slots=56 max-slots=56
where node0 has 24 cores and I am trying to leave room for my two
spawned processes.  The spawned processes have to be able to contact
the internet, so I make an MPI_INFO with MPI_Info_create and
MPI_Info_set(mpinfo, "host", "node0")
and change the MPI_INFO_NULL in the spawn calls to point to this
new MPI_Info.  (If I leave the MPI_INFO_NULL I get a different
error that is probably not of interest here.)

Now I run the mpirun like above except now with
"--hostfile nodes120" and "-n 116" after the colon.  Now I get this
error:

"There are not enough slots available in the system to satisfy the 1
slots that were requested by the application:
  andmsg
Either request fewer slots for your application, or make more slots
available for use."

I get the same error with "max-slots=24" on the first line of the
hosts file.

Sorry for the length of all that.  Request for help:  How do I set
things up to run my rank 0 program and enough copies of RanknPgm to fill
all but some number of cores on the master hardware node, and all the
other rank n programs on the other hardware "nodes" (boxes of CPUs).
[My application will do best with the default "by slot" scheduling.]

Suggestions much appreciated.  I am quite convinced my code is OK
in that it runs OK as shown above on one hardware box.  Also runs
on my laptop with 4 cores and "-n 3 RanknPgm" so I guess I don't
even really need to reserve cores for the two spawned processes.
I thought of using old-fashioned 'fork' but I really want the
extra communicators to keep asynchronous messages separated.
The documentation says overloading is OK by default, so maybe
something else is wrong here.

George Reeke




_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to