Hi Paul, Thanks for your pointers.
I'll looking into QOS and MCS after my paper deadline (Sept 5). Re QOS, as expressed to Peter in the reply I just now sent, I wonder if it the QOS of a job can be change while it's pending (submitted but not yet running). Regards, Guillaume. On Fri, Aug 30, 2019 at 10:24 AM Paul Edmon <ped...@cfa.harvard.edu> wrote: > A QoS is probably your best bet. Another variant might be MCS, which > you can use to help reduce resource fragmentation. For limits though > QoS will be your best bet. > > -Paul Edmon- > > On 8/30/19 7:33 AM, Steven Dick wrote: > > It would still be possible to use job arrays in this situation, it's > > just slightly messy. > > So the way a job array works is that you submit a single script, and > > that script is provided an integer for each subjob. The integer is in > > a range, with a possible step (default=1). > > > > To run the situation you describe, you would have to predetermine how > > many of each test you want to run (i.e., you coudln't dynamically > > change the number of jobs that run within one array)., and a master > > script would map the integer range to the job that was to be started. > > > > The most trivial way to do it would be to put the list of regressions > > in a text file and the master script would index it by line number and > > then run the appropriate command. > > A more complex way would be to do some math (a divide?) to get the > > script name and subindex (modulus?) for each regression. > > > > Both of these would require some semi-advanced scripting, but nothing > > that couldn't be cut and pasted with some trivial modifications for > > each job set. > > > > As to the unavailability of the admin ... > > An alternate approach that would require the admin's help would be to > > come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100 > > gpus, etc.) and make a QOS for each one with a gpu limit (e.g., > > maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when > > starting it to set the overall allocation for all the jobs. The admin > > woudln't need to tweak this except once, you just pick which tweak to > > use. > > > > On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault > > <gperr...@uottawa.ca> wrote: > >> Hi Steven, > >> > >> Thanks for taking the time to reply to my post. > >> > >> Setting a limit on the number of jobs for a single array isn't > sufficient because regression-tests need to launch multiple arrays, and I > would need a job limit that would take effect over all launched jobs. > >> > >> It's very possible I'm not understand something. I'll lay out a very > specific example in the hopes you can correct me if I've gone wrong > somewhere. > >> > >> Let's take the small cluster with 140 GPUs and no fairshare as an > example, because it's easier for me to explain. > >> > >> The users, who all know each other personally and interact via chat, > decide on a daily basis how many jobs each user can run at a time. > >> > >> Let's say today is Sunday (hypothetically). Nobody is actively > developing today, except that user 1 has 10 jobs running for the entire > weekend. That leaves 130 GPUs unused. > >> > >> User 2, whose jobs all run on 1 GPU decides to run a regression test. > The regression test comprises of 9 different scripts each run 40 times, for > a grand total of 360 jobs. The duration of the scripts vary from 1 and 5 > hours to complete, and the jobs take on average 4 hours to complete. > >> > >> User 2 gets the user group's approval (via chat) to use 90 GPUs (so > that 40 GPUs will remain for anyone else wanting to work that day). > >> > >> The problem I'm trying to solve is this: how do I ensure that user 2 > launches his 360 jobs in such a way that 90 jobs are in the run state > consistently until the regression test is finished? > >> > >> Keep in mind that: > >> > >> limiting each job array to 10 jobs is inefficient: when the first job > array finishes (long before the last one), only 80 GPUs will be used, and > so on as other arrays finish > >> the admin is not available, he cannot be asked to set a hard limit of > 90 jobs for user 2 just for today > >> > >> I would be happy to use job arrays if they allow me to set an > overarching job limit across multiple arrays. Perhaps this is doable. > Admttedly I'm working on a paper to be submitted in a few days, so I don't > have time to test jobs arrays thoroughly, but I will try out job arrays > more thoroughly once I've submitted my paper (ie after sept 5). > >> > >> My solution, for now, is to not use job arrays. Instead, I launch each > job individually, and I use singleton (by launching all jobs with the same > 90 unique names) to ensure that exactly 90 jobs are run at a time (in this > case, corresponding to 90 GPUs in use). > >> > >> Side note: the unavailability of the admin might sound contrived by > picking Sunday as an example, but it's in fact very typical. The admin is > not available: > >> > >> on weekends (the present example) > >> at any time outside of 9am to 5pm (keep in mind, this is a cluster used > by students in different time zones) > >> any time he is on vacation > >> anytime the he is looking after his many other responsibilities. > Constantly setting user limits that change on a daily basis would be too > much too ask. > >> > >> > >> I'd be happy if you corrected my misunderstandings, especially if you > could show me how to set a job limit that takes effect over multiple job > arrays. > >> > >> I may have very glaring oversights as I don't necessarily have a big > picture view of things (I've never been an admin, most notably), so feel > free to poke holes at the way I've constructed things. > >> > >> Regards, > >> Guillaume. > >> > >> > >> On Fri, Aug 30, 2019 at 1:22 AM Steven Dick <kg4...@gmail.com> wrote: > >>> This makes no sense and seems backwards to me. > >>> > >>> When you submit an array job, you can specify how many jobs from the > >>> array you want to run at once. > >>> So, an administrator can create a QOS that explicitly limits the user. > >>> However, you keep saying that they probably won't modify the system > >>> for just you... > >>> > >>> That seems to me to be the perfect case to use array jobs and tell it > >>> how many elements of the array to run at once. > >>> You're not using array jobs for exactly the wrong reason. > >>> > >>> On Tue, Aug 27, 2019 at 1:19 PM Guillaume Perrault Archambault > >>> <gperr...@uottawa.ca> wrote: > >>>> The reason I don't use job arrays is to be able limit the number of > jobs per users > >