Dear colleagues, I need some help controlling where a process spawned with MPI_Comm_spawn goes. I am in openmpi-1.10 under Centos 6.7. My application is written in C and am running on a RedBarn system with a master node (hardware box) that connects to the outside world and two other nodes connected to it via ethernet and Infiniband. There are two executable files, one (I'll call it "Rank0Pgm") that expects to be rank 0 and does all the I/O and the other ("RanknPgm") that only communicates via MPI messages. There are two MPI_Comm_spawns that run just after MPI_Init and an initial broadcast that shares some setup info, like this: MPI_Comm_spawn("andmsg", argv, 1, MPI_INFO_NULL, hostid, commc, &commd, &sperr); where "andmsg" is a program that needs to communicate with the internet and with all the other processes via a new communicator that will be called commd (and another name for the other one). When I run this program with no hostfile and an mpirun line something like this on a node with 32 cores: /usr/lib64/openmpi-1.10/bin/mpirun -n 1 Rank0Pgm : -n 28 RanknPgm \ < InputFile everything works fine. I assume the spawns use 2 of the 3 available cores that I did not ask the program to use.
Now I want to run on the full network, so I make a hostfile like this (call it "nodes120"): node0 slots=22 max-slots=22 n0003 slots=40 max-slots=40 n0004 slots=56 max-slots=56 where node0 has 24 cores and I am trying to leave room for my two spawned processes. The spawned processes have to be able to contact the internet, so I make an MPI_INFO with MPI_Info_create and MPI_Info_set(mpinfo, "host", "node0") and change the MPI_INFO_NULL in the spawn calls to point to this new MPI_Info. (If I leave the MPI_INFO_NULL I get a different error that is probably not of interest here.) Now I run the mpirun like above except now with "--hostfile nodes120" and "-n 116" after the colon. Now I get this error: "There are not enough slots available in the system to satisfy the 1 slots that were requested by the application: andmsg Either request fewer slots for your application, or make more slots available for use." I get the same error with "max-slots=24" on the first line of the hosts file. Sorry for the length of all that. Request for help: How do I set things up to run my rank 0 program and enough copies of RanknPgm to fill all but some number of cores on the master hardware node, and all the other rank n programs on the other hardware "nodes" (boxes of CPUs). [My application will do best with the default "by slot" scheduling.] Suggestions much appreciated. I am quite convinced my code is OK in that it runs OK as shown above on one hardware box. Also runs on my laptop with 4 cores and "-n 3 RanknPgm" so I guess I don't even really need to reserve cores for the two spawned processes. I thought of using old-fashioned 'fork' but I really want the extra communicators to keep asynchronous messages separated. The documentation says overloading is OK by default, so maybe something else is wrong here. George Reeke _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users