Another option would be to use the license feature and just set licenses
to 0 when they aren't available.
-Paul Edmon-
On 7/10/2020 12:42 PM, Raj Sahae wrote:
Hi Brian and Paul,
You both sent me suggestions about using an offline dummy node with
all features set. Thanks for your ideas but this won’t work for me as
it’s not practical. We want to allow users to queue for all supported
software versions and that easily numbers in the thousands or tens of
thousands (every branch, every commit). If I could make this solution
work, I would simply set the Available features for all nodes but this
feels like it won’t scale well, or is an improper use of the Feature
capability.
Thanks,
*Raj Sahae | *m. +1 (408) 230-8531
*From: *Raj Sahae <rsa...@tesla.com>
*Date: *Thursday, July 9, 2020 at 4:15 PM
*To: *"slurm-us...@schedmd.com" <slurm-us...@schedmd.com>
*Subject: *How to queue jobs based on non-existent features
Hi all,
My apologies if this is sent twice. The first time I sent it without
my subscription to the list being complete.
I am attempting to use Slurm as a test automation system for its
fairly advanced queueing and job control abilities, and also because
it scales very well.
However, since our use case is a bit outside the standard usage of
Slurm, we are hitting some issues that don’t appear to have obvious
solutions.
In our current setup, the Slurm nodes are hosts attached to a test
system. Our pipeline (greatly simplified) would be to install some
software on the test system and then run sets of tests against it.
In our old pipeline, this was done in a single job, however with Slurm
I was hoping to decouple these two actions as it makes the entire
pipeline more robust to update failures and would give us more finely
grained job control for the actual test run.
I would like to allow users to queue jobs with constraints indicating
which software version they need. Then separately some automated job
would scan the queue, see jobs that are not being allocated due to
missing resources, and queue software installs appropriately. We
attempted to do this using the Active/Available Features
configuration. We use HealthCheck and Epilog scripts to scrape the
test system for software properties (version, commit, etc.) and assign
them as Features. Once an install is complete and the Features are
updated, queued jobs would start to be allocated on those nodes.
Herein lies the conundrum. If a user submits a job, constraining to
run on Version A, but all nodes in the cluster are currently
configured with Features=Version-B, Slurm will fail to queue the job,
indicating an invalid feature specification. I completely understand
why Features are implemented this way, so my question is, is there
some workaround or other Slurm capabilities that I could use to
achieve this behavior? Otherwise my options seem to be:
1. Go back to how we did it before. The pipeline would have the same
level of robustness as before but at least we would still be able
to leverage other queueing capabilities of Slurm.
2. Write our own Feature or Job Submit plugin that customizes this
behavior just for us. Seems possible but adds lead time and
complexity to the situation.
It's not feasible to update the config for all
branches/versions/commits to be AvailableFeatures, as our branch
ecosystem is quite large and the maintenance of that approach would
not scale well.
Thanks,
*Raj Sahae | Manager, Software QA*
3500 Deer Creek Rd, Palo Alto, CA 94304
m. +1 (408) 230-8531 | rsa...@tesla.com
<file:///composeviewinternalloadurl/%3Cmailto:rsa...@tesla.com%3E>
<http://www.tesla.com/>