On 08/16/2011 11:15 PM, Ralph Castain wrote:
I'm not finding a bug - the code looks clean. If I send you a patch, could you 
apply it, rebuild, and send me the resulting debug output?
yes, I could do that. No problem.

thanks again, Simone

On Aug 16, 2011, at 10:18 AM, Ralph Castain wrote:

Smells like a bug - I'll take a look.


On Aug 16, 2011, at 9:10 AM, Simone Pellegrini wrote:

On 08/16/2011 02:11 PM, Ralph Castain wrote:
That should work, then. When you set the "host" property, did you give the same 
name as was in your machine file?

Debug options that might help:

-mca plm_base_verbose 5 -mca rmaps_base_verbose 5

You'll need to configure --enable-debug to get the output, but that should help 
tell us what is happening.
To be clear here is the code I am using to spawn the MPI job:
// create the info object
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "host", const_cast<char*>(hostname.c_str()));
LOG(ERROR)<<  hostname;
LOG(DEBUG)<<  "Invoking task ID '"<<  task_id<<"': '"<<  exec_name<<  "'";

MPI_Comm_spawn( const_cast<char*>(exec_name.c_str()), cargs, num_procs,
                   info, 0, MPI_COMM_SELF,&intercomm, MPI_ERRCODES_IGNORE );

delete[] cargs;
MPI_Info_free(&info);

and here is the log message:
In this case the MPI_Spaw creates a job with 3 MPI processes. As you can see MPI_Spawn 
doesn't care about my "host" setting, it just goes ahead and map the processes 
to node b05 and node b06 which are in my machinefile. (which is the same as before)

is there any way to overwrite this behaviour?

DEBUG 14628:R<0>  17:00:13] Spawning new MPI processes...
DEBUG 14628:R<0>  17:00:13] Serving event 'TASK_CREATED', (number of registered 
handlers: 1)
ERROR 14628:R<0>  17:00:13] b01
DEBUG 14628:R<0>  17:00:13] Invoking task ID '4': './simulator'
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive got message 
from [[34621,1],0]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive job launch 
command
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:rsh: setting up job 
[34621,4]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:setup_job for job 
[34621,4]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: created 
new proc [[34621,4],INVALID]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot mapping 
proc in job [34621,4] to node b02
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: adding node b02 to 
map
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc for 
job [34621,4] to node b02
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: created 
new proc [[34621,4],INVALID]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot mapping 
proc in job [34621,4] to node b01
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: adding node b01 to 
map
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc for 
job [34621,4] to node b01
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: created 
new proc [[34621,4],INVALID]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot mapping 
proc in job [34621,4] to node b02
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc for 
job [34621,4] to node b02
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:compute_usage
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons 
existing daemon [[34621,0],2] already launched
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons 
existing daemon [[34621,0],1] already launched
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:rsh: no new daemons to 
launch
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch_apps for job 
[34621,4]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:report_launched for 
job [34621,4]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
from daemon [[34621,0],0]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
completed processing
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
reissuing non-blocking recv
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
from daemon [[34621,0],1]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launched 
for proc [[34621,4],1] from daemon [[34621,0],1]: pid 14646 state 2 exit 0
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
completed processing
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
reissuing non-blocking recv
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
from daemon [[34621,0],2]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launched 
for proc [[34621,4],0] from daemon [[34621,0],2]: pid 9803 state 2 exit 0
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launched 
for proc [[34621,4],2] from daemon [[34621,0],2]: pid 9804 state 2 exit 0
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
completed processing
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:report_launched all 
apps reported
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch wiring up iof
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch completed for 
job [34621,4]
[kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive job [34621,4] 
launched

cheers, Simone P.

On Aug 16, 2011, at 5:09 AM, Simone Pellegrini wrote:

On 08/16/2011 12:30 PM, Ralph Castain wrote:
What version are you using?
OpenMPI 1.4.3

On Aug 16, 2011, at 3:19 AM, Simone Pellegrini wrote:

Dear all,
I am developing a system to manage MPI tasks on top of MPI. The architecture is 
rather simple, I have a set of scheduler processes which takes care to manage 
the resources of a node. The idea is to have 1 (or more) of those scheduler 
allocated on each node of a cluster and then create new MPI processes (on 
demand) as computation is needed. Allocation of processes is done using 
MPI_Spawn.

The system now works fine on a single node by allocating the main scheduler 
using the following mpi command:
mpirun --np 1 ./scheduler ...

Now when I scale to multiple nodes problems with default MPI behaviour starts. 
For example lets assume I have 2 nodes with 8 cpu cores each. I therefore set 
up a machine file in the following way:

s01 slots=1
s02 slots=1

and start the node schedulers in the following way:
mpirun --np 2 --machinefile machinefile ./scheduler ...

This allocates the processes correctly, now the problem starts when I invoke 
MPI_Spawn. basically MPI spawn also uses the informations from the machinefile 
and if 4 MPI processes are spawned 2 are allocated in s01 and 2 on s02. What I 
want is to allocate the processes always in the same node.

I tried to do this by specifying an MPI_Info object which is then passed to the MPI_Spawn 
routine. I tried to set the "host" property to the hostname of the machine 
where the scheduler is running but this didn't help.

Unfortunately there is very little documentation on this.

Thanks for the help,
Simone
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to