I'm not finding a bug - the code looks clean. If I send you a patch, could you apply it, rebuild, and send me the resulting debug output?
On Aug 16, 2011, at 10:18 AM, Ralph Castain wrote: > Smells like a bug - I'll take a look. > > > On Aug 16, 2011, at 9:10 AM, Simone Pellegrini wrote: > >> On 08/16/2011 02:11 PM, Ralph Castain wrote: >>> That should work, then. When you set the "host" property, did you give the >>> same name as was in your machine file? >>> >>> Debug options that might help: >>> >>> -mca plm_base_verbose 5 -mca rmaps_base_verbose 5 >>> >>> You'll need to configure --enable-debug to get the output, but that should >>> help tell us what is happening. >> To be clear here is the code I am using to spawn the MPI job: >> // create the info object >> MPI_Info info; >> MPI_Info_create(&info); >> MPI_Info_set(info, "host", const_cast<char*>(hostname.c_str())); >> LOG(ERROR) << hostname; >> LOG(DEBUG) << "Invoking task ID '" << task_id <<"': '" << exec_name << "'"; >> >> MPI_Comm_spawn( const_cast<char*>(exec_name.c_str()), cargs, num_procs, >> info, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE ); >> >> delete[] cargs; >> MPI_Info_free(&info); >> >> and here is the log message: >> In this case the MPI_Spaw creates a job with 3 MPI processes. As you can see >> MPI_Spawn doesn't care about my "host" setting, it just goes ahead and map >> the processes to node b05 and node b06 which are in my machinefile. (which >> is the same as before) >> >> is there any way to overwrite this behaviour? >> >> DEBUG 14628:R<0> 17:00:13] Spawning new MPI processes... >> DEBUG 14628:R<0> 17:00:13] Serving event 'TASK_CREATED', (number of >> registered handlers: 1) >> ERROR 14628:R<0> 17:00:13] b01 >> DEBUG 14628:R<0> 17:00:13] Invoking task ID '4': './simulator' >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive got >> message from [[34621,1],0] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive job launch >> command >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:rsh: setting up job >> [34621,4] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:setup_job for job >> [34621,4] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: >> created new proc [[34621,4],INVALID] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot >> mapping proc in job [34621,4] to node b02 >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: adding node b02 >> to map >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc >> for job [34621,4] to node b02 >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: >> created new proc [[34621,4],INVALID] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot >> mapping proc in job [34621,4] to node b01 >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: adding node b01 >> to map >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc >> for job [34621,4] to node b01 >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: >> created new proc [[34621,4],INVALID] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot >> mapping proc in job [34621,4] to node b02 >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc >> for job [34621,4] to node b02 >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:compute_usage >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons >> existing daemon [[34621,0],2] already launched >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons >> existing daemon [[34621,0],1] already launched >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:rsh: no new daemons to >> launch >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch_apps for >> job [34621,4] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:report_launched >> for job [34621,4] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch >> from daemon [[34621,0],0] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch >> completed processing >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch >> reissuing non-blocking recv >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch >> from daemon [[34621,0],1] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] >> plm:base:app_report_launched for proc [[34621,4],1] from daemon >> [[34621,0],1]: pid 14646 state 2 exit 0 >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch >> completed processing >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch >> reissuing non-blocking recv >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch >> from daemon [[34621,0],2] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] >> plm:base:app_report_launched for proc [[34621,4],0] from daemon >> [[34621,0],2]: pid 9803 state 2 exit 0 >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] >> plm:base:app_report_launched for proc [[34621,4],2] from daemon >> [[34621,0],2]: pid 9804 state 2 exit 0 >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch >> completed processing >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:report_launched >> all apps reported >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch wiring up >> iof >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch completed >> for job [34621,4] >> [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive job >> [34621,4] launched >> >> cheers, Simone P. >>> >>> >>> On Aug 16, 2011, at 5:09 AM, Simone Pellegrini wrote: >>> >>>> On 08/16/2011 12:30 PM, Ralph Castain wrote: >>>>> What version are you using? >>>> OpenMPI 1.4.3 >>>> >>>>> >>>>> On Aug 16, 2011, at 3:19 AM, Simone Pellegrini wrote: >>>>> >>>>>> Dear all, >>>>>> I am developing a system to manage MPI tasks on top of MPI. The >>>>>> architecture is rather simple, I have a set of scheduler processes which >>>>>> takes care to manage the resources of a node. The idea is to have 1 (or >>>>>> more) of those scheduler allocated on each node of a cluster and then >>>>>> create new MPI processes (on demand) as computation is needed. >>>>>> Allocation of processes is done using MPI_Spawn. >>>>>> >>>>>> The system now works fine on a single node by allocating the main >>>>>> scheduler using the following mpi command: >>>>>> mpirun --np 1 ./scheduler ... >>>>>> >>>>>> Now when I scale to multiple nodes problems with default MPI behaviour >>>>>> starts. For example lets assume I have 2 nodes with 8 cpu cores each. I >>>>>> therefore set up a machine file in the following way: >>>>>> >>>>>> s01 slots=1 >>>>>> s02 slots=1 >>>>>> >>>>>> and start the node schedulers in the following way: >>>>>> mpirun --np 2 --machinefile machinefile ./scheduler ... >>>>>> >>>>>> This allocates the processes correctly, now the problem starts when I >>>>>> invoke MPI_Spawn. basically MPI spawn also uses the informations from >>>>>> the machinefile and if 4 MPI processes are spawned 2 are allocated in >>>>>> s01 and 2 on s02. What I want is to allocate the processes always in the >>>>>> same node. >>>>>> >>>>>> I tried to do this by specifying an MPI_Info object which is then passed >>>>>> to the MPI_Spawn routine. I tried to set the "host" property to the >>>>>> hostname of the machine where the scheduler is running but this didn't >>>>>> help. >>>>>> >>>>>> Unfortunately there is very little documentation on this. >>>>>> >>>>>> Thanks for the help, >>>>>> Simone >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >