Re: [slurm-users] How to queue jobs based on non-existent features

Steven Senator (slurm-dev-list) Fri, 14 Aug 2020 14:22:38 -0700

We use a scenario that is analogous to yours using features. Features
are defined in slurm.conf and are associated with nodes from-which a
job may be submitted, as an administratively, configuration-managed
authoritative source. (NodeName=xx-login State=FUTURE
AvailableFeatures=<short-name-of-zone>) (ie.
<short-name-of-zone>={green,blue,orange,etc})


The job prolog sets the node's features to those specified by the
<short-name-of-zone> tag. The slurm.conf has: PrologFlags=Alloc set.
It also runs whatever configuration scripts to implement the set of
features. These scripts must be able to either succeed or fail-fast
within the PrologTimeout. So, we pre-prime configurations and just
flip the node from one to another.

There is also an older version which required a minimal job-submit
plugin to populate the job's AdminComment with the feature from the
'AllocNode'. However, that is now used mainly for accounting and
user-interface convenience.

That said, if I were to reimplement this, I would look seriously at
the interfaces and hooks used to connect to dynamically-provisioned
nodes such as slurm's ability to provision Google Cloud provided
nodes. (https://github.com/SchedMD/slurm-gcp,
https://cloud.google.com/solutions/deploying-slurm-cluster-compute-engine,
https://slurm.schedmd.com/SLUG19/Slurm_+_GCP.pdf). Cloud-bursting into
a freshly or dynamically provisioned node matches your use case. The
major difference is that your pool of nodes is nearby and yours.

Hope this helps,
-Steve

On Thu, Aug 13, 2020 at 7:19 PM Thomas M. Payerle <paye...@umd.edu> wrote:
>
> I have not had a chance to look at you rcode, but find it intriguing, 
> although I am not sure about use cases.  Do you do anything to lock out other 
> jobs from the affected node?
> E.g., you submit a job with unsatisfiable constraint foo.
> The tool scanning the cluster detects a job queued with foo constraint, and 
> sees node7 is idle, so does something to A so it can satisfy foo.
> However, before your job starts, my queued job starts running on node7 (maybe 
> my job needs 2 nodes and only one was free at time the scanning tool chose 
> node7).
> If the change needed for the foo feature is harmless to my job, then it is 
> not a big deal, other than your job is queued longer (and maybe the scanning 
> tool makes another foo node) ---
> but in that case why not make all nodes able to satisfy foo all the time?
>
> Maybe add a feature "generic" and have a job plugin that adds the generic 
> feature if no other feature requested, and have the scanning tool remove 
> generic when it adds foo.
> (And presumably scanning tool will detect when no more jobs pending jobs with 
> foo feature set and remove it from any idle nodes, both in actual node 
> modification and in Slurm, and
> then add the generic feature back).
> Though I can foresee possible abuses (I have a string of jobs and the cluster 
> is busy.  My jobs don't mind a foo node, so I submit them requesting foo.  
> Once idle nodes are converted to foo nodes, I get an almsot defacto 
> reservation on the foo nodes)
>
> But again, I am having trouble seeing real use cases.  Only one I can think 
> of is maybe if want to make different OS versions available; e.g. the cluster 
> is normally all CentOS, but if a job has a ubuntu20 flag, then the scanning 
> tool can take an idle node, drain it, reimage as ubuntu20, add ubuntu20 flag, 
> and undrain.
> I
>
> On Thu, Aug 13, 2020 at 7:05 PM Raj Sahae <rsa...@tesla.com> wrote:
>>
>> Hi All,
>>
>>
>>
>> I have developed a first solution to this issue that I brought up back in 
>> early July. I don't think it is complete enough to be the final solution for 
>> everyone but it does work and I think it's a good starting place to showcase 
>> the value of this feature and iterate for improvement. I wanted to let the 
>> list know in case anyone was interested in trying it themselves.
>>
>>
>>
>> In short, I was able to make minimal code changes to the slurmctld config 
>> and job scheduler such that I can:
>>
>> Submit HELD jobs into the queue with sbatch, with invalid constraints, 
>> release the job with scontrol, and have it stay in the queue but not 
>> allocated.
>> Scan the queue with some other tool, make changes to the cluster as needed, 
>> update features, and the scheduler will pick up the new feature changes and 
>> schedule the job appropriately.
>>
>>
>>
>> The patch of my code changes is attached 
>> (0001-Add-a-config-option-allowing-unavailable-constraints.patch). I 
>> branched from the tip of 20.02 at the time, commit 34c96f1a2d.
>>
>>
>>
>> I did attempt to do this with plugins at first but after creating skeleton 
>> plugins for a node_feature plugin and a scheduler plugin, I realized that 
>> the constraint check that occurs in the job scheduler happens before any of 
>> those plugins are called.
>>
>>
>>
>> According to the job launch logic flow 
>> (https://slurm.schedmd.com/job_launch.html) perhaps I could do something in 
>> the job submit plugin but at that point I had spent 4 days playing with the 
>> plugin code and I wanted to prototype a bit faster, so I chose to make 
>> changes directly in the job scheduler.
>>
>>
>>
>> If anyone cares to read through the patch and try out my changes, I would be 
>> interested to know your thoughts on the following:
>>
>>
>>
>> 1. How could this be done with a plugin? Should it be?
>>
>> 2. This feature is incredibly valuable to me. Would it be valuable to you?
>>
>> 3. What general changes need to be made to the code to make it appropriate 
>> to submit a patch to SchedMD?
>>
>>
>>
>> To develop and test, (I'm on MacOS), I was using a modified version of this 
>> docker compose setup (https://github.com/giovtorres/slurm-docker-cluster) 
>> and I would rsync my repo into the `slurm` subfolder before building the 
>> docker image. I have attached that patch as well 
>> (slurm-docker-cluster.patch).
>>
>>
>>
>> To see the feature work, build with the attached slurm patch and enable the 
>> appropriate config option in your slurm.conf, for example if you want 
>> feature prefixes `branch-` and `commit-`, you would add the following entry:
>>
>>
>>
>>     SchedulerDynamicFeatures=branch-,commit-
>>
>>
>>
>> Launch the cluster (in my case with docker-compose) and exec into any of the 
>> nodes. Then set features on the nodes:
>>
>>
>>
>>     scontrol update c[12] Features=property-1,property-2,branch-A,commit-X
>>
>>
>>
>> You should be able to submit a batch job as normal:
>>
>>
>>
>>     sbatch -p normal -C branch-A  -D /data test.sh
>>
>>
>>
>> Now queue a job with an undefined dynamic feature, it will fail to allocate 
>> (expected):
>>
>>
>>
>>     sbatch -p normal -C branch-B -D /data test.sh
>>
>>
>>
>> Now queue a HELD job with an undefined dynamic feature, then release it.
>>
>>
>>
>>     sbatch -p normal -C branch-B -D /data -H test.sh
>>
>>     scontrol release <job_id>
>>
>>
>>
>> This should place an unallocated job into the queue with a reason of 
>> BadConstraints.
>>
>> You can then update a node with the new feature and it should get scheduled 
>> to run.
>>
>>
>>
>>     scontrol update NodeName=c1 AvailableFeatures=platform-test,branch-B 
>> ActiveFeatures=platform-test,branch-B
>>
>>
>>
>>
>>
>> Hopefully that little demo works for you. We have been running with this 
>> change in a small test cluster for about 2 weeks and so far no known issues.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Raj Sahae | m. +1 (408) 230-8531
>>
>>
>>
>> From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Alex 
>> Chekholko <a...@calicolabs.com>
>> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>
>> Date: Friday, July 10, 2020 at 11:37 AM
>> To: Slurm User Community List <slurm-users@lists.schedmd.com>
>> Subject: Re: [slurm-users] How to queue jobs based on non-existent features
>>
>>
>>
>> Hey Raj,
>>
>>
>>
>> To me this all sounds, at a high level, a job for some kind of lightweight 
>> middleware on top of SLURM.  E.g. makefiles or something like that.  Where 
>> each pipeline would be managed outside of slurm and would maybe submit a job 
>> to install some software, then submit a job to run something on that node, 
>> then run a third job to clean up / remove software.  And it would have to 
>> interact with the several slurm features that have been mentioned in this 
>> thread, such as features or licenses or job dependencies, or gres.
>>
>>
>>
>> snakemake might be an example, but there are many others.
>>
>>
>>
>> Regards,
>>
>> Alex
>>
>>
>>
>> On Fri, Jul 10, 2020 at 11:14 AM Raj Sahae <rsa...@tesla.com> wrote:
>>
>> Hi Paddy,
>>
>>
>>
>> Yes, this is a CI/CD pipeline. We currently use Jenkins pipelines but it has 
>> some significant drawbacks that Slurm solves out of the box that make it an 
>> attractive alternative.
>>
>> You noted some of them already, like good real time queue management, 
>> pre-emption, node weighting, high resolution priority queueing.
>>
>> Jenkins also doesn’t scale as well w.r.t. node management, it’s quite 
>> resource heavy.
>>
>>
>>
>> My original email was a bit wordy but I should emphasize that if we want 
>> Slurm to do the exact same thing as our current Jenkins pipeline, we can 
>> already do that and it works reasonably well.
>>
>> Now I’m trying to move beyond feature parity and am having trouble doing so.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Raj Sahae | m. +1 (408) 230-8531
>>
>>
>>
>> From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Paddy 
>> Doyle <pa...@tchpc.tcd.ie>
>> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>
>> Date: Friday, July 10, 2020 at 10:31 AM
>> To: Slurm User Community List <slurm-users@lists.schedmd.com>
>> Subject: Re: [slurm-users] How to queue jobs based on non-existent features
>>
>>
>>
>> Hi Raj,
>>
>> It sounds like you might be coming from a CI/CD pipeline setup, but just in
>> case you're not, would you consider something like Jenkins or Gitlab CI
>> instead of Slurm?
>>
>> The users could create multi-stage pipelines, with the 'build' stage
>> installing the required software version, and then multiple 'test' stages
>> to run the tests.
>>
>> It's not the same idea as queuing up multiple jobs. Nor do you get queue
>> priorities or weighting and all of that good stuff from Slurm that you are
>> looking for.
>>
>> Within Slurm, yeah writing custom JobSubmitPlugins and NodeFeaturesPlugins
>> might be required.
>>
>> Paddy
>>
>> On Thu, Jul 09, 2020 at 11:15:57PM +0000, Raj Sahae wrote:
>>
>> > Hi all,
>> >
>> > My apologies if this is sent twice. The first time I sent it without my 
>> > subscription to the list being complete.
>> >
>> > I am attempting to use Slurm as a test automation system for its fairly 
>> > advanced queueing and job control abilities, and also because it scales 
>> > very well.
>> > However, since our use case is a bit outside the standard usage of Slurm, 
>> > we are hitting some issues that don’t appear to have obvious solutions.
>> >
>> > In our current setup, the Slurm nodes are hosts attached to a test system. 
>> > Our pipeline (greatly simplified) would be to install some software on the 
>> > test system and then run sets of tests against it.
>> > In our old pipeline, this was done in a single job, however with Slurm I 
>> > was hoping to decouple these two actions as it makes the entire pipeline 
>> > more robust to update failures and would give us more finely grained job 
>> > control for the actual test run.
>> >
>> > I would like to allow users to queue jobs with constraints indicating 
>> > which software version they need. Then separately some automated job would 
>> > scan the queue, see jobs that are not being allocated due to missing 
>> > resources, and queue software installs appropriately. We attempted to do 
>> > this using the Active/Available Features configuration. We use HealthCheck 
>> > and Epilog scripts to scrape the test system for software properties 
>> > (version, commit, etc.) and assign them as Features. Once an install is 
>> > complete and the Features are updated, queued jobs would start to be 
>> > allocated on those nodes.
>> >
>> > Herein lies the conundrum. If a user submits a job, constraining to run on 
>> > Version A, but all nodes in the cluster are currently configured with 
>> > Features=Version-B, Slurm will fail to queue the job, indicating an 
>> > invalid feature specification. I completely understand why Features are 
>> > implemented this way, so my question is, is there some workaround or other 
>> > Slurm capabilities that I could use to achieve this behavior? Otherwise my 
>> > options seem to be:
>> >
>> > 1. Go back to how we did it before. The pipeline would have the same level 
>> > of robustness as before but at least we would still be able to leverage 
>> > other queueing capabilities of Slurm.
>> > 2. Write our own Feature or Job Submit plugin that customizes this 
>> > behavior just for us. Seems possible but adds lead time and complexity to 
>> > the situation.
>> >
>> > It's not feasible to update the config for all branches/versions/commits 
>> > to be AvailableFeatures, as our branch ecosystem is quite large and the 
>> > maintenance of that approach would not scale well.
>> >
>> > Thanks,
>> >
>> > Raj Sahae | Manager, Software QA
>> > 3500 Deer Creek Rd, Palo Alto, CA 94304
>> > m. +1 (408) 230-8531 | 
>> > rsa...@tesla.com<file:///composeviewinternalloadurl/%3Cmailto:rsa...@tesla.com%3E>
>> >
>> > [cid:image001.png@01D6560C.399F5D30]<http://www.tesla.com/>
>> >
>>
>>
>>
>> --
>> Paddy Doyle
>> Research IT / Trinity Centre for High Performance Computing,
>> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
>> Phone: +353-1-896-3725
>> https://www.tchpc.tcd.ie/
>
>
>
> --
> Tom Payerle
> DIT-ACIGS/Mid-Atlantic Crossroads        paye...@umd.edu
> 5825 University Research Park               (301) 405-6135
> University of Maryland
> College Park, MD 20740-3831

Re: [slurm-users] How to queue jobs based on non-existent features

Reply via email to