Am 27.08.2011 um 16:35 schrieb Ralph Castain: > > On Aug 27, 2011, at 8:28 AM, Rayson Ho wrote: > >> On Sat, Aug 27, 2011 at 9:12 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> OMPI has no way of knowing that you will turn the node on at some future >>> point. All it can do is try to launch the job on the provided node, which >>> fails because the node doesn't respond. >>> You'll have to come up with some scheme for telling the node to turn on in >>> anticipation of starting a job - a resource manager is typically used for >>> that purpose. >> >> Hi Ralph, >> >> Are you referring to a specific resource manager/batch system?? AFAIK, >> no common batch systems support MPI_Spawn properly... > > Usually, resource managers "turn on" nodes when allocating them for use by a > job - SLURM is an example that does this. Helps the cluster save energy when > not in use. I believe almost all the RM's out there now support this to some > degree. > > Support for MPI_Comm_spawn (i.e., dynamically allocating new nodes as > required by a running MPI job and turning them on) doesn't exist (to my > knowledge) at this time, mostly because this MPI feature is so rarely used. > I've helped (integrating from the OMPI side) several groups that were adding > such support to various RM's (typically Torque), but I don't think that code > has hit a distribution yet.
Can you please point me to these projects? I was always wondering how to phrase it in a submission request. It would need include to specify: I need 2 hrs 2 cores, then 30 minutes 1 core and finally 6 hrs 4 cores which targets already features of a real-time queuing system. -- Reuti >> Rayson >> >> >> >> >>> On Aug 27, 2011, at 6:58 AM, Rafael Braga wrote: >>> >>> I would like to know how to add nodes during a job execution. >>> Now my hostfile has the node 10.0.0.23 that is off, >>> I would start this node during the execution so that the job can use it >>> When I run the command: >>> >>> mpirun -np 2 -hostfile /tmp/hosts application >>> >>> the following message appears: >>> >>> ssh: connect to host 10.0.0.23 port 22: No route to host >>> -------------------------------------------------------------------------- >>> A daemon (pid 10773) died unexpectedly with status 255 while attempting >>> to launch so we are aborting. >>> >>> There may be more information reported by the environment (see above). >>> >>> This may be because the daemon was unable to find all the needed shared >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the >>> location of the shared libraries on the remote nodes and this will >>> automatically be forwarded to the remote nodes. >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> mpirun noticed that the job aborted, but has no info as to the process >>> that caused that situation. >>> -------------------------------------------------------------------------- >>> mpirun: clean termination accomplished >>> >>> thanks a lot, >>> >>> -- >>> Rafael Braga >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> >> -- >> Rayson >> >> ================================================== >> Open Grid Scheduler - The Official Open Source Grid Engine >> http://gridscheduler.sourceforge.net/ >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users