Am 27.08.2011 um 16:35 schrieb Ralph Castain:

> 
> On Aug 27, 2011, at 8:28 AM, Rayson Ho wrote:
> 
>> On Sat, Aug 27, 2011 at 9:12 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>> OMPI has no way of knowing that you will turn the node on at some future
>>> point. All it can do is try to launch the job on the provided node, which
>>> fails because the node doesn't respond.
>>> You'll have to come up with some scheme for telling the node to turn on in
>>> anticipation of starting a job - a resource manager is typically used for
>>> that purpose.
>> 
>> Hi Ralph,
>> 
>> Are you referring to a specific resource manager/batch system?? AFAIK,
>> no common batch systems support MPI_Spawn properly...
> 
> Usually, resource managers "turn on" nodes when allocating them for use by a 
> job - SLURM is an example that does this. Helps the cluster save energy when 
> not in use. I believe almost all the RM's out there now support this to some 
> degree.
> 
> Support for MPI_Comm_spawn (i.e., dynamically allocating new nodes as 
> required by a running MPI job and turning them on) doesn't exist (to my 
> knowledge) at this time, mostly because this MPI feature is so rarely used. 
> I've helped (integrating from the OMPI side) several groups that were adding 
> such support to various RM's (typically Torque), but I don't think that code 
> has hit a distribution yet.

Can you please point me to these projects?

I was always wondering how to phrase it in a submission request. It would need 
include to specify: I need 2 hrs 2 cores, then 30 minutes 1 core and finally 6 
hrs 4 cores which targets already features of a real-time queuing system.

-- Reuti



>> Rayson
>> 
>> 
>> 
>> 
>>> On Aug 27, 2011, at 6:58 AM, Rafael Braga wrote:
>>> 
>>> I would like to know how to add nodes during a job execution.
>>> Now my hostfile has the node 10.0.0.23 that is off,
>>> I would start this node during the execution so that the job can use it
>>> When I run the command:
>>> 
>>> mpirun -np 2 -hostfile /tmp/hosts application
>>> 
>>> the following message appears:
>>> 
>>> ssh: connect to host 10.0.0.23 port 22: No route to host
>>> --------------------------------------------------------------------------
>>> A daemon (pid 10773) died unexpectedly with status 255 while attempting
>>> to launch so we are aborting.
>>> 
>>> There may be more information reported by the environment (see above).
>>> 
>>> This may be because the daemon was unable to find all the needed shared
>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>> location of the shared libraries on the remote nodes and this will
>>> automatically be forwarded to the remote nodes.
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun noticed that the job aborted, but has no info as to the process
>>> that caused that situation.
>>> --------------------------------------------------------------------------
>>> mpirun: clean termination accomplished
>>> 
>>> thanks a lot,
>>> 
>>> --
>>> Rafael Braga
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> 
>> 
>> -- 
>> Rayson
>> 
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to