We use a scenario that is analogous to yours using features. Features are defined in slurm.conf and are associated with nodes from-which a job may be submitted, as an administratively, configuration-managed authoritative source. (NodeName=xx-login State=FUTURE AvailableFeatures=<short-name-of-zone>) (ie. <short-name-of-zone>={green,blue,orange,etc})
The job prolog sets the node's features to those specified by the <short-name-of-zone> tag. The slurm.conf has: PrologFlags=Alloc set. It also runs whatever configuration scripts to implement the set of features. These scripts must be able to either succeed or fail-fast within the PrologTimeout. So, we pre-prime configurations and just flip the node from one to another. There is also an older version which required a minimal job-submit plugin to populate the job's AdminComment with the feature from the 'AllocNode'. However, that is now used mainly for accounting and user-interface convenience. That said, if I were to reimplement this, I would look seriously at the interfaces and hooks used to connect to dynamically-provisioned nodes such as slurm's ability to provision Google Cloud provided nodes. (https://github.com/SchedMD/slurm-gcp, https://cloud.google.com/solutions/deploying-slurm-cluster-compute-engine, https://slurm.schedmd.com/SLUG19/Slurm_+_GCP.pdf). Cloud-bursting into a freshly or dynamically provisioned node matches your use case. The major difference is that your pool of nodes is nearby and yours. Hope this helps, -Steve On Thu, Aug 13, 2020 at 7:19 PM Thomas M. Payerle <paye...@umd.edu> wrote: > > I have not had a chance to look at you rcode, but find it intriguing, > although I am not sure about use cases. Do you do anything to lock out other > jobs from the affected node? > E.g., you submit a job with unsatisfiable constraint foo. > The tool scanning the cluster detects a job queued with foo constraint, and > sees node7 is idle, so does something to A so it can satisfy foo. > However, before your job starts, my queued job starts running on node7 (maybe > my job needs 2 nodes and only one was free at time the scanning tool chose > node7). > If the change needed for the foo feature is harmless to my job, then it is > not a big deal, other than your job is queued longer (and maybe the scanning > tool makes another foo node) --- > but in that case why not make all nodes able to satisfy foo all the time? > > Maybe add a feature "generic" and have a job plugin that adds the generic > feature if no other feature requested, and have the scanning tool remove > generic when it adds foo. > (And presumably scanning tool will detect when no more jobs pending jobs with > foo feature set and remove it from any idle nodes, both in actual node > modification and in Slurm, and > then add the generic feature back). > Though I can foresee possible abuses (I have a string of jobs and the cluster > is busy. My jobs don't mind a foo node, so I submit them requesting foo. > Once idle nodes are converted to foo nodes, I get an almsot defacto > reservation on the foo nodes) > > But again, I am having trouble seeing real use cases. Only one I can think > of is maybe if want to make different OS versions available; e.g. the cluster > is normally all CentOS, but if a job has a ubuntu20 flag, then the scanning > tool can take an idle node, drain it, reimage as ubuntu20, add ubuntu20 flag, > and undrain. > I > > On Thu, Aug 13, 2020 at 7:05 PM Raj Sahae <rsa...@tesla.com> wrote: >> >> Hi All, >> >> >> >> I have developed a first solution to this issue that I brought up back in >> early July. I don't think it is complete enough to be the final solution for >> everyone but it does work and I think it's a good starting place to showcase >> the value of this feature and iterate for improvement. I wanted to let the >> list know in case anyone was interested in trying it themselves. >> >> >> >> In short, I was able to make minimal code changes to the slurmctld config >> and job scheduler such that I can: >> >> Submit HELD jobs into the queue with sbatch, with invalid constraints, >> release the job with scontrol, and have it stay in the queue but not >> allocated. >> Scan the queue with some other tool, make changes to the cluster as needed, >> update features, and the scheduler will pick up the new feature changes and >> schedule the job appropriately. >> >> >> >> The patch of my code changes is attached >> (0001-Add-a-config-option-allowing-unavailable-constraints.patch). I >> branched from the tip of 20.02 at the time, commit 34c96f1a2d. >> >> >> >> I did attempt to do this with plugins at first but after creating skeleton >> plugins for a node_feature plugin and a scheduler plugin, I realized that >> the constraint check that occurs in the job scheduler happens before any of >> those plugins are called. >> >> >> >> According to the job launch logic flow >> (https://slurm.schedmd.com/job_launch.html) perhaps I could do something in >> the job submit plugin but at that point I had spent 4 days playing with the >> plugin code and I wanted to prototype a bit faster, so I chose to make >> changes directly in the job scheduler. >> >> >> >> If anyone cares to read through the patch and try out my changes, I would be >> interested to know your thoughts on the following: >> >> >> >> 1. How could this be done with a plugin? Should it be? >> >> 2. This feature is incredibly valuable to me. Would it be valuable to you? >> >> 3. What general changes need to be made to the code to make it appropriate >> to submit a patch to SchedMD? >> >> >> >> To develop and test, (I'm on MacOS), I was using a modified version of this >> docker compose setup (https://github.com/giovtorres/slurm-docker-cluster) >> and I would rsync my repo into the `slurm` subfolder before building the >> docker image. I have attached that patch as well >> (slurm-docker-cluster.patch). >> >> >> >> To see the feature work, build with the attached slurm patch and enable the >> appropriate config option in your slurm.conf, for example if you want >> feature prefixes `branch-` and `commit-`, you would add the following entry: >> >> >> >> SchedulerDynamicFeatures=branch-,commit- >> >> >> >> Launch the cluster (in my case with docker-compose) and exec into any of the >> nodes. Then set features on the nodes: >> >> >> >> scontrol update c[12] Features=property-1,property-2,branch-A,commit-X >> >> >> >> You should be able to submit a batch job as normal: >> >> >> >> sbatch -p normal -C branch-A -D /data test.sh >> >> >> >> Now queue a job with an undefined dynamic feature, it will fail to allocate >> (expected): >> >> >> >> sbatch -p normal -C branch-B -D /data test.sh >> >> >> >> Now queue a HELD job with an undefined dynamic feature, then release it. >> >> >> >> sbatch -p normal -C branch-B -D /data -H test.sh >> >> scontrol release <job_id> >> >> >> >> This should place an unallocated job into the queue with a reason of >> BadConstraints. >> >> You can then update a node with the new feature and it should get scheduled >> to run. >> >> >> >> scontrol update NodeName=c1 AvailableFeatures=platform-test,branch-B >> ActiveFeatures=platform-test,branch-B >> >> >> >> >> >> Hopefully that little demo works for you. We have been running with this >> change in a small test cluster for about 2 weeks and so far no known issues. >> >> >> >> Thanks, >> >> >> >> Raj Sahae | m. +1 (408) 230-8531 >> >> >> >> From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Alex >> Chekholko <a...@calicolabs.com> >> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com> >> Date: Friday, July 10, 2020 at 11:37 AM >> To: Slurm User Community List <slurm-users@lists.schedmd.com> >> Subject: Re: [slurm-users] How to queue jobs based on non-existent features >> >> >> >> Hey Raj, >> >> >> >> To me this all sounds, at a high level, a job for some kind of lightweight >> middleware on top of SLURM. E.g. makefiles or something like that. Where >> each pipeline would be managed outside of slurm and would maybe submit a job >> to install some software, then submit a job to run something on that node, >> then run a third job to clean up / remove software. And it would have to >> interact with the several slurm features that have been mentioned in this >> thread, such as features or licenses or job dependencies, or gres. >> >> >> >> snakemake might be an example, but there are many others. >> >> >> >> Regards, >> >> Alex >> >> >> >> On Fri, Jul 10, 2020 at 11:14 AM Raj Sahae <rsa...@tesla.com> wrote: >> >> Hi Paddy, >> >> >> >> Yes, this is a CI/CD pipeline. We currently use Jenkins pipelines but it has >> some significant drawbacks that Slurm solves out of the box that make it an >> attractive alternative. >> >> You noted some of them already, like good real time queue management, >> pre-emption, node weighting, high resolution priority queueing. >> >> Jenkins also doesn’t scale as well w.r.t. node management, it’s quite >> resource heavy. >> >> >> >> My original email was a bit wordy but I should emphasize that if we want >> Slurm to do the exact same thing as our current Jenkins pipeline, we can >> already do that and it works reasonably well. >> >> Now I’m trying to move beyond feature parity and am having trouble doing so. >> >> >> >> Thanks, >> >> >> >> Raj Sahae | m. +1 (408) 230-8531 >> >> >> >> From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Paddy >> Doyle <pa...@tchpc.tcd.ie> >> Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com> >> Date: Friday, July 10, 2020 at 10:31 AM >> To: Slurm User Community List <slurm-users@lists.schedmd.com> >> Subject: Re: [slurm-users] How to queue jobs based on non-existent features >> >> >> >> Hi Raj, >> >> It sounds like you might be coming from a CI/CD pipeline setup, but just in >> case you're not, would you consider something like Jenkins or Gitlab CI >> instead of Slurm? >> >> The users could create multi-stage pipelines, with the 'build' stage >> installing the required software version, and then multiple 'test' stages >> to run the tests. >> >> It's not the same idea as queuing up multiple jobs. Nor do you get queue >> priorities or weighting and all of that good stuff from Slurm that you are >> looking for. >> >> Within Slurm, yeah writing custom JobSubmitPlugins and NodeFeaturesPlugins >> might be required. >> >> Paddy >> >> On Thu, Jul 09, 2020 at 11:15:57PM +0000, Raj Sahae wrote: >> >> > Hi all, >> > >> > My apologies if this is sent twice. The first time I sent it without my >> > subscription to the list being complete. >> > >> > I am attempting to use Slurm as a test automation system for its fairly >> > advanced queueing and job control abilities, and also because it scales >> > very well. >> > However, since our use case is a bit outside the standard usage of Slurm, >> > we are hitting some issues that don’t appear to have obvious solutions. >> > >> > In our current setup, the Slurm nodes are hosts attached to a test system. >> > Our pipeline (greatly simplified) would be to install some software on the >> > test system and then run sets of tests against it. >> > In our old pipeline, this was done in a single job, however with Slurm I >> > was hoping to decouple these two actions as it makes the entire pipeline >> > more robust to update failures and would give us more finely grained job >> > control for the actual test run. >> > >> > I would like to allow users to queue jobs with constraints indicating >> > which software version they need. Then separately some automated job would >> > scan the queue, see jobs that are not being allocated due to missing >> > resources, and queue software installs appropriately. We attempted to do >> > this using the Active/Available Features configuration. We use HealthCheck >> > and Epilog scripts to scrape the test system for software properties >> > (version, commit, etc.) and assign them as Features. Once an install is >> > complete and the Features are updated, queued jobs would start to be >> > allocated on those nodes. >> > >> > Herein lies the conundrum. If a user submits a job, constraining to run on >> > Version A, but all nodes in the cluster are currently configured with >> > Features=Version-B, Slurm will fail to queue the job, indicating an >> > invalid feature specification. I completely understand why Features are >> > implemented this way, so my question is, is there some workaround or other >> > Slurm capabilities that I could use to achieve this behavior? Otherwise my >> > options seem to be: >> > >> > 1. Go back to how we did it before. The pipeline would have the same level >> > of robustness as before but at least we would still be able to leverage >> > other queueing capabilities of Slurm. >> > 2. Write our own Feature or Job Submit plugin that customizes this >> > behavior just for us. Seems possible but adds lead time and complexity to >> > the situation. >> > >> > It's not feasible to update the config for all branches/versions/commits >> > to be AvailableFeatures, as our branch ecosystem is quite large and the >> > maintenance of that approach would not scale well. >> > >> > Thanks, >> > >> > Raj Sahae | Manager, Software QA >> > 3500 Deer Creek Rd, Palo Alto, CA 94304 >> > m. +1 (408) 230-8531 | >> > rsa...@tesla.com<file:///composeviewinternalloadurl/%3Cmailto:rsa...@tesla.com%3E> >> > >> > [cid:image001.png@01D6560C.399F5D30]<http://www.tesla.com/> >> > >> >> >> >> -- >> Paddy Doyle >> Research IT / Trinity Centre for High Performance Computing, >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. >> Phone: +353-1-896-3725 >> https://www.tchpc.tcd.ie/ > > > > -- > Tom Payerle > DIT-ACIGS/Mid-Atlantic Crossroads paye...@umd.edu > 5825 University Research Park (301) 405-6135 > University of Maryland > College Park, MD 20740-3831