No, I haven't seen that - if you can provide an example, we can take a look at it.
Thanks Ralph On 6/19/08 8:15 AM, "Sacerdoti, Federico" <federico.sacerd...@deshawresearch.com> wrote: > Ralph, another issue perhaps you can shed some light on. > > When launching with orterun, we sometimes see null characters in the > stdout output. These do not show up on a terminal, but when piped to a > file they are visible in an editor. They also can show up in the middle > of a line, and so can interfere with greps on the output, etc. > > Have you seen this before? I am working on a simple test case, but > unfortunately have not found one that is deterministic so far. > > Thanks, > Federico > > -----Original Message----- > From: Ralph H Castain [mailto:r...@lanl.gov] > Sent: Tuesday, June 17, 2008 1:09 PM > To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] SLURM and OpenMPI > > I can believe 1.2.x has problems in that regard. Some of that has > nothing to > do with slurm and reflects internal issues with 1.2. > > We have made it much more resistant to those problems in the upcoming > 1.3 > release, but there is no plan to retrofit those changes to 1.2. Part of > the > problem was that we weren't using the --kill-on-bad-exit flag when we > called > srun internally, which has been fixed for 1.3. > > BTW: we actually do use srun to launch the daemons - we just call it > internally from inside orterun. The only real difference is that we use > orterun to setup the cmd line and then tell the daemons what they need > to > do. The issues you are seeing relate to our ability to detect that srun > has > failed, and/or that one or more daemons have failed to launch or do > something they were supposed to do. The 1.2 system has problems in that > regard, which was one motivation for the 1.3 overhaul. > > I would argue that slurm allowing us to attempt to launch on a > no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we > use > srun to launch the daemons - the only reason we hang is that srun is not > returning with an error. I've seen this on other systems as well, but > have > no real answer - if slurm doesn't indicate an error has occurred, I'm > not > sure what I can do about it. > > We are unlikely to use srun to directly launch jobs (i.e., to have slurm > directly launch the job from an srun cmd line without mpirun) anytime > soon. > It isn't clear there is enough benefit to justify the rather large > effort, > especially considering what would be required to maintain scalability. > Decisions on all that are still pending, though, which means any > significant > change in that regard wouldn't be released until sometime next year. > > Ralph > > On 6/17/08 10:39 AM, "Sacerdoti, Federico" > <federico.sacerd...@deshawresearch.com> wrote: > >> Ralph, >> >> I was wondering what the status of this feature was (using srun to >> launch orted daemons)? I have two new bug reports to add from our >> experience using orterun from 1.2.6 on our 4000 CPU infiniband > cluster. >> >> 1. Orterun will happily hang if it is asked to run on an invalid slurm >> job, e.g. if the job has exceeded its timelimit. This would be > trivially >> fixed if you used srun to launch, as they would fail with non-zero > exit >> codes. >> >> 2. A very simple orterun invocation hangs instead of exiting with an >> error. In this case the executable does not exist, and we would expect >> orterun to exit non-zero. This has caused >> headaches with some workflow management script that automatically > start >> jobs. >> >> salloc -N2 -p swdev orterun dummy-binary-I-dont-exist >> [hang] >> >> orterun dummy-binary-I-dont-exist >> [hang] >> >> Thanks, >> Federico >> >> -----Original Message----- >> From: Sacerdoti, Federico >> Sent: Friday, March 21, 2008 5:41 PM >> To: 'Open MPI Users' >> Subject: RE: [OMPI users] SLURM and OpenMPI >> >> >> Ralph wrote: >> "I don't know if I would say we "interfere" with SLURM - I would say >> that we >> are only lightly integrated with SLURM at this time. We use SLURM as a >> resource manager to assign nodes, and then map processes onto those >> nodes >> according to the user's wishes. We chose to do this because srun > applies >> its >> own load balancing algorithms if you launch processes directly with > it, >> which leaves the user with little flexibility to specify their desired >> rank/slot mapping. We chose to support the greater flexibility." >> >> Ralph, we wrote a launcher for mvapich that uses srun to launch but >> keeps tight control of where processes are started. The way we did it >> was to force srun to launch a single process on a particular node. >> >> The launcher calls many of these: >> srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS >> >> Hope this helps (and we are looking forward to a tighter orterun/slurm >> integration as you know). >> >> Regards, >> Federico >> >> -----Original Message----- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > On >> Behalf Of Ralph Castain >> Sent: Thursday, March 20, 2008 6:41 PM >> To: Open MPI Users <us...@open-mpi.org> >> Cc: Ralph Castain >> Subject: Re: [OMPI users] SLURM and OpenMPI >> >> Hi there >> >> I am no slurm expert. However, it is our understanding that >> SLURM_TASKS_PER_NODE means the number of slots allocated to the job, > not >> the >> number of tasks to be executed on each node. So the 4(x2) tells us > that >> we >> have 4 slots on each of two nodes to work with. You got 4 slots on > each >> node >> because you used the -N option, which told slurm to assign all slots > on >> that >> node to this job - I assume you have 4 processors on your nodes. > OpenMPI >> parses that string to get the allocation, then maps the number of >> specified >> processes against it. >> >> It is possible that the interpretation of SLURM_TASKS_PER_NODE is >> different >> when used to allocate as opposed to directly launch processes. Our >> typical >> usage is for someone to do: >> >> srun -N 2 -A >> mpirun -np 2 helloworld >> >> In other words, we use srun to create an allocation, and then run > mpirun >> separately within it. >> >> >> I am therefore unsure what the "-n 2" will do here. If I believe the >> documentation, it would seem to imply that srun will attempt to launch >> two >> copies of "mpirun -np 2 helloworld", yet your output doesn't seem to >> support >> that interpretation. It would appear that the "-n 2" is being ignored >> and >> only one copy of mpirun is being launched. I'm no slurm expert, so >> perhaps >> that interpretation is incorrect. >> >> Assuming that the -n 2 is ignored in this situation, your command > line: >> >>> srun -N 2 -n 2 -b mpirun -np 2 helloworld >> >> will cause mpirun to launch two processes, mapped byslot against the >> slurm >> allocation of two nodes, each having 4 slots. Thus, both processes > will >> be >> launched on the first node, which is what you observed. >> >> Similarly, the command line >> >>> srun -N 2 -n 2 -b mpirun helloworld >> >> doesn't specify the #procs to mpirun. In that case, mpirun will launch > a >> process on every available slot in the allocation. Given this command, >> that >> means 4 procs will be launched on each of the 2 nodes, for a total of > 8 >> procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the >> second. >> Again, this is what you observed. >> >> I don't know if I would say we "interfere" with SLURM - I would say > that >> we >> are only lightly integrated with SLURM at this time. We use SLURM as a >> resource manager to assign nodes, and then map processes onto those >> nodes >> according to the user's wishes. We chose to do this because srun > applies >> its >> own load balancing algorithms if you launch processes directly with > it, >> which leaves the user with little flexibility to specify their desired >> rank/slot mapping. We chose to support the greater flexibility. >> >> Using the SLURM-defined mapping will require launching without our >> mpirun. >> This capability is still under development, and there are issues with >> doing >> that in slurm environments which need to be addressed. It is at a > lower >> priority than providing such support for TM right now, so I wouldn't >> expect >> it to become available for several months at least. >> >> Alternatively, it may be possible for mpirun to get the SLURM-defined >> mapping and use it to launch the processes. If we can get it somehow, >> there >> is no problem launching it as specified - the problem is how to get > the >> map! >> Unfortunately, slurm's licensing prevents us from using its internal >> APIs, >> so obtaining the map is not an easy thing to do. >> >> Anyone who wants to help accelerate that timetable is welcome to > contact >> me. >> We know the technical issues - this is mostly a problem of (a) >> priorities >> versus my available time, and (b) similar considerations on the part > of >> the >> slurm folks to do the work themselves. >> >> Ralph >> >> >> On 3/20/08 3:48 PM, "Tim Prins" <tpr...@open-mpi.org> wrote: >> >>> Hi Werner, >>> >>> Open MPI does things a little bit differently than other MPIs when it >>> comes to supporting SLURM. See >>> http://www.open-mpi.org/faq/?category=slurm >>> for general information about running with Open MPI on SLURM. >>> >>> After trying the commands you sent, I am actually a bit surprised by >> the >>> results. I would have expected this mode of operation to work. But >>> looking at the environment variables that SLURM is setting for us, I >> can >>> see why it doesn't. >>> >>> On a cluster with 4 cores/node, I ran: >>> [tprins@odin ~]$ cat mprun.sh >>> #!/bin/sh >>> printenv >>> [tprins@odin ~]$ srun -N 2 -n 2 -b mprun.sh >>> srun: jobid 55641 submitted >>> [tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE >>> SLURM_TASKS_PER_NODE=4(x2) >>> [tprins@odin ~]$ >>> >>> Which seems to be wrong, since the srun man page says that >>> SLURM_TASKS_PER_NODE is the "Number of tasks to be initiated on each >>> node". This seems to imply that the value should be "1(x2)". So maybe >>> this is a SLURM problem? If this value were correctly reported, Open >> MPI >>> should work fine for what you wanted to do. >>> >>> Two other things: >>> 1. You should probably use the command line option '--npernode' for >>> mpirun instead of setting the rmaps_base_n_pernode directly. >>> 2. In regards to your second example below, Open MPI by default maps >> 'by >>> slot'. That is, it will fill all available slots on the first node >>> before moving to the second. You can change this, see: >>> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling >>> >>> I have copied Ralph on this mail to see if he has a better response. >>> >>> Tim >>> >>> Werner Augustin wrote: >>>> Hi, >>>> >>>> At our site here at the University of Karlsruhe we are running two >>>> large clusters with SLURM and HP-MPI. For our new cluster we want to >>>> keep SLURM and switch to OpenMPI. While testing I got the following >>>> problem: >>>> >>>> with HP-MPI I do something like >>>> >>>> srun -N 2 -n 2 -b mpirun -srun helloworld >>>> >>>> and get >>>> >>>> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13. >>>> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14. >>>> >>>> when I try the same with OpenMPI (version 1.2.4) >>>> >>>> srun -N 2 -n 2 -b mpirun helloworld >>>> >>>> I get >>>> >>>> Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13. >>>> Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13. >>>> Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14. >>>> Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13. >>>> Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14. >>>> Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13. >>>> Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14. >>>> Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14. >>>> >>>> and with >>>> >>>> srun -N 2 -n 2 -b mpirun -np 2 helloworld >>>> >>>> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13. >>>> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13. >>>> >>>> which is still wrong, because it uses only one of the two allocated >>>> nodes. >>>> >>>> OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment >>>> variables, starts with slurm one orted per node and tasks upto the >>>> maximum number of slots on every node. So basically it also does >>>> some 'resource management' and interferes with slurm. OK, I can fix >> that >>>> with a mpirun wrapper script which calls mpirun with the right -np >> and >>>> the right rmaps_base_n_pernode setting, but it gets worse. We want > to >>>> allocate computing power on a per cpu base instead of per node, i.e. >>>> different user might share a node. In addition slurm allows to >> schedule >>>> according to memory usage. Therefore it is important that on every >> node >>>> there is exactly the number of tasks running that slurm wants. The >> only >>>> solution I came up with is to generate for every job a detailed >>>> hostfile and call mpirun --hostfile. Any suggestions for > improvement? >>>> >>>> I've found a discussion thread "slurm and all-srun orterun" in the >>>> mailinglist archive concerning the same problem, where Ralph Castain >>>> announced that he is working on two new launch methods which would >> fix >>>> my problems. Unfortunately his email address is deleted from the >>>> archive, so it would be really nice if the friendly elf mentioned >> there >>>> is still around and could forward my mail to him. >>>> >>>> Thanks in advance, >>>> Werner Augustin >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >