On Sun, Apr 3, 2011 at 11:41 AM, Ralph Castain <r...@open-mpi.org> wrote: > > On Apr 3, 2011, at 9:34 AM, Laurence Marks wrote: > >> On Sun, Apr 3, 2011 at 9:56 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>> On Apr 3, 2011, at 8:14 AM, Laurence Marks wrote: >>> >>>> Let me expand on this slightly (in response to Ralph Castain's posting >>>> -- I had digest mode set). As currently constructed a shellscript in >>>> Wien2k (www.wien2k.at) launches a series of tasks using >>>> >>>> ($remote $remotemachine "cd $PWD;$t $ttt;rm -f .lock_$lockfile[$p]") >>>>>> .time1_$loop & >>>> >>>> where the standard setting for "remote" is "ssh", remotemachine is the >>>> appropriate host, "t" is "time" and "ttt" is a concatenation of >>>> commands, for instance when using 2 cores on one node for Task1, 2 >>>> cores on 2 nodes for Task2 and 2 cores on 1 node for Task3 >>>> >>>> Task1: >>>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine1 >>>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_1.def >>>> Task2: >>>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 4 -machinefile .machine2 >>>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_2.def >>>> Task3: >>>> mpirun -v -x LD_LIBRARY_PATH -x PATH -np 2 -machinefile .machine3 >>>> /home/lma712/src/Virgin_10.1/lapw1Q_mpi lapw1Q_3.def >>>> >>>> This is a stable script, works under SGI, linux, mvapich and many >>>> others using ssh or rsh (although I've never myself used it with rsh). >>>> It is general purpose, i.e. will work to run just 1 task on 8x8 >>>> nodes/cores or 8 parallel tasks on 8 nodes all with 8 cores or any >>>> scatter of nodes/cores. >>>> >>>> According to some, ssh is becoming obsolete within supercomputers and >>>> the "replacement" is pbsdsh at least under Torque. >>> >>> Somebody is playing an April Fools joke on you. The majority of >>> supercomputers use ssh as their sole launch mechanism, and I have seen no >>> indication that anyone intends to change that situation. That said, Torque >>> is certainly popular and a good environment. >> >> Alas, it is not an April fools joke, to quote from >> http://www.bear.bham.ac.uk/bluebear/pbsdsh.shtml >> "pbsdsh can be used as a replacement for an ssh or rsh command which >> invokes a user command on a worker machine. Some applications expect >> the availability of rsh or ssh in order to invoke parts of the >> computation on the sister worker nodes of the main worker. Using >> pbsdsh instead is necessary on this cluster because direct use of ssh >> or rsh is not allowed, for accounting and security reasons." > > Ah, but that is an administrative decision by a single organization - not the > global supercomputer industry. :-) > >> >> I am not using that computer. A scenario that I have come across is >> that when a msub job is killed because it has exceeded it's Walltime >> mpi tasks spawned by ssh may not be terminated because (so I am told) >> Torque does not know about them. > > Not true with OMPI. Torque will kill mpirun, which will in turn cause all MPI > procs to die. Yes, it's true that Torque won't know about the MPI procs > itself. However, OMPI is designed such that termination of mpirun by the > resource manager will cause all apps to die.
How does Torque on NodeA know that an mpi launched on NodeB by ssh should be killed? OMPI is designed (from what I can see) for all mpirun to be started from the same node, not distributed mpi launched independently from multiple nodes. I am not certain that killing the ssh on NodeA will in fact terminate a mpi launched on NodeB (i.e. by ssh NodeB mpirun AAA...) with OMPI. For certain qdel does not do this, in moab canceljob might (but I suspect not). > >> >>> -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Chair, Commission on Electron Crystallography of IUCR www.numis.northwestern.edu/ Research is to see what everybody else has seen, and to think what nobody else has thought Albert Szent-Gyorgi