I'm not finding a bug - the code looks clean. If I send you a patch, could you 
apply it, rebuild, and send me the resulting debug output?


On Aug 16, 2011, at 10:18 AM, Ralph Castain wrote:

> Smells like a bug - I'll take a look.
> 
> 
> On Aug 16, 2011, at 9:10 AM, Simone Pellegrini wrote:
> 
>> On 08/16/2011 02:11 PM, Ralph Castain wrote:
>>> That should work, then. When you set the "host" property, did you give the 
>>> same name as was in your machine file?
>>> 
>>> Debug options that might help:
>>> 
>>> -mca plm_base_verbose 5 -mca rmaps_base_verbose 5
>>> 
>>> You'll need to configure --enable-debug to get the output, but that should 
>>> help tell us what is happening.
>> To be clear here is the code I am using to spawn the MPI job:
>> // create the info object
>> MPI_Info info;
>> MPI_Info_create(&info);
>> MPI_Info_set(info, "host", const_cast<char*>(hostname.c_str()));
>> LOG(ERROR) << hostname;
>> LOG(DEBUG) << "Invoking task ID '" << task_id <<"': '" << exec_name << "'";
>> 
>> MPI_Comm_spawn( const_cast<char*>(exec_name.c_str()), cargs, num_procs,
>>                   info, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE );
>> 
>> delete[] cargs;
>> MPI_Info_free(&info);
>> 
>> and here is the log message:
>> In this case the MPI_Spaw creates a job with 3 MPI processes. As you can see 
>> MPI_Spawn doesn't care about my "host" setting, it just goes ahead and map 
>> the processes to node b05 and node b06 which are in my machinefile. (which 
>> is the same as before)
>> 
>> is there any way to overwrite this behaviour?
>> 
>> DEBUG 14628:R<0> 17:00:13] Spawning new MPI processes...
>> DEBUG 14628:R<0> 17:00:13] Serving event 'TASK_CREATED', (number of 
>> registered handlers: 1)
>> ERROR 14628:R<0> 17:00:13] b01
>> DEBUG 14628:R<0> 17:00:13] Invoking task ID '4': './simulator'
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive got 
>> message from [[34621,1],0]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive job launch 
>> command
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:rsh: setting up job 
>> [34621,4]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:setup_job for job 
>> [34621,4]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: 
>> created new proc [[34621,4],INVALID]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot 
>> mapping proc in job [34621,4] to node b02
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: adding node b02 
>> to map
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc 
>> for job [34621,4] to node b02
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: 
>> created new proc [[34621,4],INVALID]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot 
>> mapping proc in job [34621,4] to node b01
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: adding node b01 
>> to map
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc 
>> for job [34621,4] to node b01
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: 
>> created new proc [[34621,4],INVALID]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot 
>> mapping proc in job [34621,4] to node b02
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc 
>> for job [34621,4] to node b02
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:compute_usage
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons 
>> existing daemon [[34621,0],2] already launched
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons 
>> existing daemon [[34621,0],1] already launched
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:rsh: no new daemons to 
>> launch
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch_apps for 
>> job [34621,4]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:report_launched 
>> for job [34621,4]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
>> from daemon [[34621,0],0]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
>> completed processing
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
>> reissuing non-blocking recv
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
>> from daemon [[34621,0],1]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] 
>> plm:base:app_report_launched for proc [[34621,4],1] from daemon 
>> [[34621,0],1]: pid 14646 state 2 exit 0
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
>> completed processing
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
>> reissuing non-blocking recv
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
>> from daemon [[34621,0],2]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] 
>> plm:base:app_report_launched for proc [[34621,4],0] from daemon 
>> [[34621,0],2]: pid 9803 state 2 exit 0
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] 
>> plm:base:app_report_launched for proc [[34621,4],2] from daemon 
>> [[34621,0],2]: pid 9804 state 2 exit 0
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch 
>> completed processing
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:report_launched 
>> all apps reported
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch wiring up 
>> iof
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch completed 
>> for job [34621,4]
>> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive job 
>> [34621,4] launched
>> 
>> cheers, Simone P.
>>> 
>>> 
>>> On Aug 16, 2011, at 5:09 AM, Simone Pellegrini wrote:
>>> 
>>>> On 08/16/2011 12:30 PM, Ralph Castain wrote:
>>>>> What version are you using?
>>>> OpenMPI 1.4.3
>>>> 
>>>>> 
>>>>> On Aug 16, 2011, at 3:19 AM, Simone Pellegrini wrote:
>>>>> 
>>>>>> Dear all,
>>>>>> I am developing a system to manage MPI tasks on top of MPI. The 
>>>>>> architecture is rather simple, I have a set of scheduler processes which 
>>>>>> takes care to manage the resources of a node. The idea is to have 1 (or 
>>>>>> more) of those scheduler allocated on each node of a cluster and then 
>>>>>> create new MPI processes (on demand) as computation is needed. 
>>>>>> Allocation of processes is done using MPI_Spawn.
>>>>>> 
>>>>>> The system now works fine on a single node by allocating the main 
>>>>>> scheduler using the following mpi command:
>>>>>> mpirun --np 1 ./scheduler ...
>>>>>> 
>>>>>> Now when I scale to multiple nodes problems with default MPI behaviour 
>>>>>> starts. For example lets assume I have 2 nodes with 8 cpu cores each. I 
>>>>>> therefore set up a machine file in the following way:
>>>>>> 
>>>>>> s01 slots=1
>>>>>> s02 slots=1
>>>>>> 
>>>>>> and start the node schedulers in the following way:
>>>>>> mpirun --np 2 --machinefile machinefile ./scheduler ...
>>>>>> 
>>>>>> This allocates the processes correctly, now the problem starts when I 
>>>>>> invoke MPI_Spawn. basically MPI spawn also uses the informations from 
>>>>>> the machinefile and if 4 MPI processes are spawned 2 are allocated in 
>>>>>> s01 and 2 on s02. What I want is to allocate the processes always in the 
>>>>>> same node.
>>>>>> 
>>>>>> I tried to do this by specifying an MPI_Info object which is then passed 
>>>>>> to the MPI_Spawn routine. I tried to set the "host" property to the 
>>>>>> hostname of the machine where the scheduler is running but this didn't 
>>>>>> help.
>>>>>> 
>>>>>> Unfortunately there is very little documentation on this.
>>>>>> 
>>>>>> Thanks for the help,
>>>>>> Simone
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 


Reply via email to