On Sun, Apr 3, 2011 at 11:41 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> On Apr 3, 2011, at 9:34 AM, Laurence Marks wrote:
>
>> On Sun, Apr 3, 2011 at 9:56 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>> On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote:
>>>
>>>> Let me expand on this slightly (in response to Ralph Castain's posting
>>>> -- I had digest mode set). As currently constructed a shellscript in
>>>> Wien2k (www.wien2k.at) launches a series of tasks using
>>>>
>>>> ($remote $remotemachine "cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]")
>>>>>> .time1_$loop &
>>>>
>>>> where the standard setting for "remote" is "ssh", remotemachine is the
>>>> appropriate host, "t" is "time" and "ttt" is a concatenation of
>>>> commands, for instance when using 2 cores on one node for Task1, 2
>>>> cores on 2 nodes for Task2 and 2 cores on 1 node for Task3
>>>>
>>>> Task1:
>>>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine1
>>>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def
>>>> Task2:
>>>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 4 -machinefile .machine2
>>>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_2.def
>>>> Task3:
>>>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine3
>>>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_3.def
>>>>
>>>> This is a stable script, works under SGI, linux, mvapich and many
>>>> others using ssh or rsh (although I've never myself used it with rsh).
>>>> It is general purpose, i.e. will work to run just 1 task on 8x8
>>>> nodes/cores or 8 parallel tasks on 8 nodes all with 8 cores or any
>>>> scatter of nodes/cores.
>>>>
>>>> According to some, ssh is becoming obsolete within supercomputers and
>>>> the "replacement" is pbsdsh at least under Torque.
>>>
>>> Somebody is playing an April Fools joke on you. The majority of 
>>> supercomputers use ssh as their sole launch mechanism, and I have seen no 
>>> indication that anyone intends to change that situation. That said, Torque 
>>> is certainly popular and a good environment.
>>
>> Alas, it is not an April fools joke, to quote from
>> http://www.bear.bham.ac.uk/bluebear/pbsdsh.shtml
>> "pbsdsh can be used as a replacement for an ssh or rsh command which
>> invokes a user command on a worker machine. Some applications expect
>> the availability of rsh or ssh  in order to invoke parts of the
>> computation on the sister worker nodes of the main worker. Using
>> pbsdsh instead is necessary on this cluster because direct use of ssh
>> or rsh is not allowed, for accounting and security reasons."
>
> Ah, but that is an administrative decision by a single organization - not the 
> global supercomputer industry. :-)
>
>>
>> I am not using that computer. A scenario that I have come across is
>> that when a msub job is killed because it has exceeded it's Walltime
>> mpi tasks spawned by ssh may not be terminated because (so I am told)
>> Torque does not know about them.
>
> Not true with OMPI. Torque will kill mpirun, which will in turn cause all MPI 
> procs to die. Yes, it's true that Torque won't know about the MPI procs 
> itself. However, OMPI is designed such that termination of mpirun by the 
> resource manager will cause all apps to die.

How does Torque on NodeA know that an mpi launched on NodeB by ssh
should be killed? OMPI is designed (from what I can see) for all
mpirun to be started from the same node, not distributed mpi launched
independently from multiple nodes. I am not certain that killing the
ssh on NodeA will in fact terminate a mpi launched on NodeB (i.e. by
ssh NodeB mpirun AAA...) with OMPI. For certain qdel does not do this,
in moab canceljob might (but I suspect not).

>
>>
>>>

-- 
Laurence Marks
Department of Materials Science and Engineering
MSE Rm 2036 Cook Hall
2220 N Campus Drive
Northwestern University
Evanston, IL 60208, USA
Tel: (847) 491-3996 Fax: (847) 491-7820
email: L-marks at northwestern dot edu
Web: www.numis.northwestern.edu
Chair, Commission on Electron Crystallography of IUCR
www.numis.northwestern.edu/
Research is to see what everybody else has seen, and to think what
nobody else has thought
Albert Szent-Gyorgi

Reply via email to