No, I haven't seen that - if you can provide an example, we can take a look
at it.

Thanks
Ralph



On 6/19/08 8:15 AM, "Sacerdoti, Federico"
<federico.sacerd...@deshawresearch.com> wrote:

> Ralph, another issue perhaps you can shed some light on.
> 
> When launching with orterun, we sometimes see null characters in the
> stdout output. These do not show up on a terminal, but when piped to a
> file they are visible in an editor. They also can show up in the middle
> of a line, and so can interfere with greps on the output, etc.
> 
> Have you seen this before? I am working on a simple test case, but
> unfortunately have not found one that is deterministic so far.
> 
> Thanks,
> Federico 
> 
> -----Original Message-----
> From: Ralph H Castain [mailto:r...@lanl.gov]
> Sent: Tuesday, June 17, 2008 1:09 PM
> To: Sacerdoti, Federico; Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] SLURM and OpenMPI
> 
> I can believe 1.2.x has problems in that regard. Some of that has
> nothing to
> do with slurm and reflects internal issues with 1.2.
> 
> We have made it much more resistant to those problems in the upcoming
> 1.3
> release, but there is no plan to retrofit those changes to 1.2. Part of
> the
> problem was that we weren't using the --kill-on-bad-exit flag when we
> called
> srun internally, which has been fixed for 1.3.
> 
> BTW: we actually do use srun to launch the daemons - we just call it
> internally from inside orterun. The only real difference is that we use
> orterun to setup the cmd line and then tell the daemons what they need
> to
> do. The issues you are seeing relate to our ability to detect that srun
> has
> failed, and/or that one or more daemons have failed to launch or do
> something they were supposed to do. The 1.2 system has problems in that
> regard, which was one motivation for the 1.3 overhaul.
> 
> I would argue that slurm allowing us to attempt to launch on a
> no-longer-valid allocation is a slurm issue, not OMPI's. As I said, we
> use
> srun to launch the daemons - the only reason we hang is that srun is not
> returning with an error. I've seen this on other systems as well, but
> have
> no real answer - if slurm doesn't indicate an error has occurred, I'm
> not
> sure what I can do about it.
> 
> We are unlikely to use srun to directly launch jobs (i.e., to have slurm
> directly launch the job from an srun cmd line without mpirun) anytime
> soon.
> It isn't clear there is enough benefit to justify the rather large
> effort,
> especially considering what would be required to maintain scalability.
> Decisions on all that are still pending, though, which means any
> significant
> change in that regard wouldn't be released until sometime next year.
> 
> Ralph
> 
> On 6/17/08 10:39 AM, "Sacerdoti, Federico"
> <federico.sacerd...@deshawresearch.com> wrote:
> 
>> Ralph,
>> 
>> I was wondering what the status of this feature was (using srun to
>> launch orted daemons)? I have two new bug reports to add from our
>> experience using orterun from 1.2.6 on our 4000 CPU infiniband
> cluster.
>> 
>> 1. Orterun will happily hang if it is asked to run on an invalid slurm
>> job, e.g. if the job has exceeded its timelimit. This would be
> trivially
>> fixed if you used srun to launch, as they would fail with non-zero
> exit
>> codes.
>> 
>> 2. A very simple orterun invocation hangs instead of exiting with an
>> error. In this case the executable does not exist, and we would expect
>> orterun to exit non-zero. This has caused
>> headaches with some workflow management script that automatically
> start
>> jobs.
>> 
>> salloc -N2 -p swdev orterun dummy-binary-I-dont-exist
>> [hang]
>> 
>> orterun dummy-binary-I-dont-exist
>> [hang]
>> 
>> Thanks,
>> Federico
>> 
>> -----Original Message-----
>> From: Sacerdoti, Federico
>> Sent: Friday, March 21, 2008 5:41 PM
>> To: 'Open MPI Users'
>> Subject: RE: [OMPI users] SLURM and OpenMPI
>> 
>> 
>> Ralph wrote:
>> "I don't know if I would say we "interfere" with SLURM - I would say
>> that we
>> are only lightly integrated with SLURM at this time. We use SLURM as a
>> resource manager to assign nodes, and then map processes onto those
>> nodes
>> according to the user's wishes. We chose to do this because srun
> applies
>> its
>> own load balancing algorithms if you launch processes directly with
> it,
>> which leaves the user with little flexibility to specify their desired
>> rank/slot mapping. We chose to support the greater flexibility."
>>  
>> Ralph, we wrote a launcher for mvapich that uses srun to launch but
>> keeps tight control of where processes are started. The way we did it
>> was to force srun to launch a single process on a particular node.
>> 
>> The launcher calls many of these:
>>  srun --jobid $JOBID -N 1 -n 1 -w host005 CMD ARGS
>> 
>> Hope this helps (and we are looking forward to a tighter orterun/slurm
>> integration as you know).
>> 
>> Regards,
>> Federico
>> 
>> -----Original Message-----
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
>> Behalf Of Ralph Castain
>> Sent: Thursday, March 20, 2008 6:41 PM
>> To: Open MPI Users <us...@open-mpi.org>
>> Cc: Ralph Castain
>> Subject: Re: [OMPI users] SLURM and OpenMPI
>> 
>> Hi there
>> 
>> I am no slurm expert. However, it is our understanding that
>> SLURM_TASKS_PER_NODE means the number of slots allocated to the job,
> not
>> the
>> number of tasks to be executed on each node. So the 4(x2) tells us
> that
>> we
>> have 4 slots on each of two nodes to work with. You got 4 slots on
> each
>> node
>> because you used the -N option, which told slurm to assign all slots
> on
>> that
>> node to this job - I assume you have 4 processors on your nodes.
> OpenMPI
>> parses that string to get the allocation, then maps the number of
>> specified
>> processes against it.
>> 
>> It is possible that the interpretation of SLURM_TASKS_PER_NODE is
>> different
>> when used to allocate as opposed to directly launch processes. Our
>> typical
>> usage is for someone to do:
>> 
>> srun -N 2 -A
>> mpirun -np 2 helloworld
>> 
>> In other words, we use srun to create an allocation, and then run
> mpirun
>> separately within it.
>> 
>> 
>> I am therefore unsure what the "-n 2" will do here. If I believe the
>> documentation, it would seem to imply that srun will attempt to launch
>> two
>> copies of "mpirun -np 2 helloworld", yet your output doesn't seem to
>> support
>> that interpretation. It would appear that the "-n 2" is being ignored
>> and
>> only one copy of mpirun is being launched. I'm no slurm expert, so
>> perhaps
>> that interpretation is incorrect.
>> 
>> Assuming that the -n 2 is ignored in this situation, your command
> line:
>> 
>>> srun -N 2 -n 2 -b mpirun -np 2 helloworld
>> 
>> will cause mpirun to launch two processes, mapped byslot against the
>> slurm
>> allocation of two nodes, each having 4 slots. Thus, both processes
> will
>> be
>> launched on the first node, which is what you observed.
>> 
>> Similarly, the command line
>> 
>>> srun -N 2 -n 2 -b mpirun helloworld
>> 
>> doesn't specify the #procs to mpirun. In that case, mpirun will launch
> a
>> process on every available slot in the allocation. Given this command,
>> that
>> means 4 procs will be launched on each of the 2 nodes, for a total of
> 8
>> procs. Ranks 0-3 will be placed on the first node, ranks 4-7 on the
>> second.
>> Again, this is what you observed.
>> 
>> I don't know if I would say we "interfere" with SLURM - I would say
> that
>> we
>> are only lightly integrated with SLURM at this time. We use SLURM as a
>> resource manager to assign nodes, and then map processes onto those
>> nodes
>> according to the user's wishes. We chose to do this because srun
> applies
>> its
>> own load balancing algorithms if you launch processes directly with
> it,
>> which leaves the user with little flexibility to specify their desired
>> rank/slot mapping. We chose to support the greater flexibility.
>> 
>> Using the SLURM-defined mapping will require launching without our
>> mpirun.
>> This capability is still under development, and there are issues with
>> doing
>> that in slurm environments which need to be addressed. It is at a
> lower
>> priority than providing such support for TM right now, so I wouldn't
>> expect
>> it to become available for several months at least.
>> 
>> Alternatively, it may be possible for mpirun to get the SLURM-defined
>> mapping and use it to launch the processes. If we can get it somehow,
>> there
>> is no problem launching it as specified - the problem is how to get
> the
>> map!
>> Unfortunately, slurm's licensing prevents us from using its internal
>> APIs,
>> so obtaining the map is not an easy thing to do.
>> 
>> Anyone who wants to help accelerate that timetable is welcome to
> contact
>> me.
>> We know the technical issues - this is mostly a problem of (a)
>> priorities
>> versus my available time, and (b) similar considerations on the part
> of
>> the
>> slurm folks to do the work themselves.
>> 
>> Ralph
>> 
>> 
>> On 3/20/08 3:48 PM, "Tim Prins" <tpr...@open-mpi.org> wrote:
>> 
>>> Hi Werner,
>>> 
>>> Open MPI does things a little bit differently than other MPIs when it
>>> comes to supporting SLURM. See
>>> http://www.open-mpi.org/faq/?category=slurm
>>> for general information about running with Open MPI on SLURM.
>>> 
>>> After trying the commands you sent, I am actually a bit surprised by
>> the
>>> results. I would have expected this mode of operation to work. But
>>> looking at the environment variables that SLURM is setting for us, I
>> can
>>> see why it doesn't.
>>> 
>>> On a cluster with 4 cores/node, I ran:
>>> [tprins@odin ~]$ cat mprun.sh
>>> #!/bin/sh
>>> printenv
>>> [tprins@odin ~]$  srun -N 2 -n 2 -b mprun.sh
>>> srun: jobid 55641 submitted
>>> [tprins@odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE
>>> SLURM_TASKS_PER_NODE=4(x2)
>>> [tprins@odin ~]$
>>> 
>>> Which seems to be wrong, since the srun man page says that
>>> SLURM_TASKS_PER_NODE is the "Number  of tasks to be initiated on each
>>> node". This seems to imply that the value should be "1(x2)". So maybe
>>> this is a SLURM problem? If this value were correctly reported, Open
>> MPI
>>> should work fine for what you wanted to do.
>>> 
>>> Two other things:
>>> 1. You should probably use the command line option '--npernode' for
>>> mpirun instead of setting the rmaps_base_n_pernode directly.
>>> 2. In regards to your second example below, Open MPI by default maps
>> 'by
>>> slot'. That is, it will fill all available slots on the first node
>>> before moving to the second. You can change this, see:
>>> http://www.open-mpi.org/faq/?category=running#mpirun-scheduling
>>> 
>>> I have copied Ralph on this mail to see if he has a better response.
>>> 
>>> Tim
>>> 
>>> Werner Augustin wrote:
>>>> Hi,
>>>> 
>>>> At our site here at the University of Karlsruhe we are running two
>>>> large clusters with SLURM and HP-MPI. For our new cluster we want to
>>>> keep SLURM and switch to OpenMPI. While testing I got the following
>>>> problem:
>>>> 
>>>> with HP-MPI I do something like
>>>> 
>>>> srun -N 2 -n 2 -b mpirun -srun helloworld
>>>> 
>>>> and get 
>>>> 
>>>> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.
>>>> 
>>>> when I try the same with OpenMPI (version 1.2.4)
>>>> 
>>>> srun -N 2 -n 2 -b mpirun helloworld
>>>> 
>>>> I get
>>>> 
>>>> Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
>>>> Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
>>>> Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
>>>> Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.
>>>> 
>>>> and with 
>>>> 
>>>> srun -N 2 -n 2 -b mpirun -np 2 helloworld
>>>> 
>>>> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
>>>> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.
>>>> 
>>>> which is still wrong, because it uses only one of the two allocated
>>>> nodes.
>>>> 
>>>> OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment
>>>> variables, starts with slurm one orted per node and tasks upto the
>>>> maximum number of slots on every node. So basically it also does
>>>> some 'resource management' and interferes with slurm. OK, I can fix
>> that
>>>> with a mpirun wrapper script which calls mpirun with the right -np
>> and
>>>> the right rmaps_base_n_pernode setting, but it gets worse. We want
> to
>>>> allocate computing power on a per cpu base instead of per node, i.e.
>>>> different user might share a node. In addition slurm allows to
>> schedule
>>>> according to memory usage. Therefore it is important that on every
>> node
>>>> there is exactly the number of tasks running that slurm wants. The
>> only
>>>> solution I came up with is to generate for every job a detailed
>>>> hostfile and call mpirun --hostfile. Any suggestions for
> improvement?
>>>> 
>>>> I've found a discussion thread "slurm and all-srun orterun" in the
>>>> mailinglist archive concerning the same problem, where Ralph Castain
>>>> announced that he is working on two new launch methods which would
>> fix
>>>> my problems. Unfortunately his email address is deleted from the
>>>> archive, so it would be really nice if the friendly elf mentioned
>> there
>>>> is still around and could forward my mail to him.
>>>> 
>>>> Thanks in advance,
>>>> Werner Augustin
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Reply via email to