Re: [slurm-users] Current status of checkpointing

2020-08-14 Thread Christopher Samuel
On 8/14/20 6:17 am, Stefan Staeglich wrote: what's the current status of the checkpointing support in SLURM? There isn't any these days, there used to be support for BLCR but that's been dropped as BLCR is no more. I know from talking with SchedMD they are of the opinion that any current c

Re: [slurm-users] How to queue jobs based on non-existent features

2020-08-14 Thread Steven Senator (slurm-dev-list)
We use a scenario that is analogous to yours using features. Features are defined in slurm.conf and are associated with nodes from-which a job may be submitted, as an administratively, configuration-managed authoritative source. (NodeName=xx-login State=FUTURE AvailableFeatures=) (ie. ={green,blue,

Re: [slurm-users] How to queue jobs based on non-existent features

2020-08-14 Thread Raj Sahae
Hi Thomas, We do not need to lock out jobs from the other nodes. All our jobs specify constraints and will be scheduled on nodes accordingly. To follow your example: * a job with unsatisfiable constraint foo is submitted * the scanning tool detects the job queued and schedules another j

Re: [slurm-users] Restricting job submissions

2020-08-14 Thread Paul Edmon
Probably your best bet would be to to use the job_submit.lua script and block using that. -Paul Edmon- On 8/14/2020 11:05 AM, rapier wrote: Hi, I'm relatively new to slurm and I'm trying to deal with something I don't know how to address. I have a reservation set up that users can submit jo

[slurm-users] Restricting job submissions

2020-08-14 Thread rapier
Hi, I'm relatively new to slurm and I'm trying to deal with something I don't know how to address. I have a reservation set up that users can submit jobs to. However, I don't want them to be able to submit any job at all to this reservation. I want to restrict them to only running jobs that c

[slurm-users] Current status of checkpointing

2020-08-14 Thread Stefan Staeglich
Hi, what's the current status of the checkpointing support in SLURM? There was a CRIU plugin mentioned: https://slurm.schedmd.com/SLUG16/ciemat-cr.pdf But it doesn't exist in SLURM 19.05.5 on Ubuntu 20.04. And the manual page mentions an OpenMPI plugin only. Best, Stefan -- Stefan Stäglich,

Re: [slurm-users] ProfileInfluxDB: Influxdb server with self-signed certificate

2020-08-14 Thread Jeffrey T Frey
Making the certificate globally-available on the host may not always be permissible. If I were you, I'd write/suggest a modification to the plugin to make the CA path (CURLOPT_CAPATH) and verification itself (CURLOPT_SSL_VERIFYPEER) configurable in Slurm. They are both straightforward options

Re: [slurm-users] ProfileInfluxDB: Influxdb server with self-signed certificate

2020-08-14 Thread Stefan Staeglich
Hi, all except of /etc/ssl/certs/ca-certificates.crt is ignored. So I've copied it to /usr/local/share/ca-certificates/ and run update-ca-certificates. Now it's working :) Best, Stefan Am Freitag, 14. August 2020, 11:42:04 CEST schrieb Stefan Staeglich: > Hi, > > I try to setup the acct_gathe

Re: [slurm-users] [External] Re: openmpi / UCX / srun

2020-08-14 Thread Stijn De Weirdt
hi max, > I have set: 'UCX_TLS=tcp,self,sm' on the slurmd's. > Is it better to build slurm without UCX support or should I simply install > rdma-core? i would look into using mellanox ofed with rdma-core, as it is what mellanox is shifting towards or has already done (not sure what 4.9 has tbh). o

Re: [slurm-users] scheduling issue

2020-08-14 Thread Renfro, Michael
We’ve run a similar setup since I moved to Slurm 3 years ago, with no issues. Could you share partition definitions from your slurm.conf? When you see a bunch of jobs pending, which ones have a reason of “Resources”? Those should be the next ones to run, and ones with a reason of “Priority” are

[slurm-users] ProfileInfluxDB: Influxdb server with self-signed certificate

2020-08-14 Thread Stefan Staeglich
Hi, I try to setup the acct_gather plugin ProfileInfluxDB. Unfortunately our influxdb server has a self-signed certificate only: [2020-08-14T09:54:30.007] [46.0] error: acct_gather_profile/influxdb _send_data: curl_easy_perform failed to send data (discarded). Reason: SSL peer certificate or SS

[slurm-users] scheduling issue

2020-08-14 Thread Erik Eisold
Hello all, we are experiencing an issue in our cluster where sometimes entire nodes remain idle while jobs are pending in the queue that could run on the nodes in question. Our node topology is a bit special where almost all our nodes are in one common partition a subset of all those nodes a