Re: [slurm-users] Question on how to make slurm aware of a CVMFS revision

2020-02-27 Thread Bjørn-Helge Mevik
"Klein, Dennis" writes: > * Can I (and if yes, how can I) update the GRES count dynamically > (The idea would be to monitor the revision changes on all cvmfs > mountpoints with a simple daemon process on each worker node which > then notifies slurm on a revision change)? Perhaps the daemon pro

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread mercan
Hi; At your partition definition, there is "Shared=NO". This is means "do not share nodes between jobs". This parameter conflict with "OverSubscribe=FORCE:12 " parameter. Acording to the slurm documentation, the Shared parameter has been replaced by the OverSubscribe parameter. But, I suppose

Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-27 Thread Sean Crosby
I remember that we had to add this to our /etc/ssh/sshd_config to get X11 to work with Slurm 19.05 X11UseLocalhost no We added this to our login nodes (where users ssh to), and then restarted the ssh server. You would then need to log out and log back in with X11 forwarding again. Sean -- Se

Re: [slurm-users] Question on how to make slurm aware of a CVMFS revision

2020-02-27 Thread Gennaro Oliva
Hi Dennis, I don't know how cvmfs works. On Wed, Feb 26, 2020 at 06:40:23PM +, Klein, Dennis wrote: > our slurm worker nodes mount several read-only software repositories > via the cvmfs filesystem [1]. Each repository is versioned and each > cvmfs mountpoint automatically switches to serving

[slurm-users] memory in job_submit.lua

2020-02-27 Thread Marcus Wagner
Hi folks, does anyone know how to detect in the lua submission script, if the user used --mem or --mem-per-cpu? And also, if it is possible to "unset" this setting? The reason is, we want to remove all memory thingies set by the user for exclusive jobs. Best Marcus -- Marcus Wagner, Dipl.

Re: [slurm-users] RHEL8 support - Missing Symbols in SelectType libraries

2020-02-27 Thread Stephan Walter
Hi James, Slurm 19.05.5 face the same problem with Centos8, so hardened environment, and the same fix helps. I will test slurm 20.02 as soon as possible. https://bugs.schedmd.com/show_bug.cgi?id=8414 Best Regards, Stephan -Original Message- From: slurm-users [mailto:slurm-users-boun.

Re: [slurm-users] memory in job_submit.lua

2020-02-27 Thread Bjørn-Helge Mevik
Marcus Wagner writes: > does anyone know how to detect in the lua submission script, if the > user used --mem or --mem-per-cpu? > > And also, if it is possible to "unset" this setting? Something like this should work: if job_desc.pn_min_memory ~= slurm.NO_VAL64 then -- --mem or --mem-per-cpu

Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-27 Thread Pär Lundö
Hi Sean, Thank you for your reply. I will test it asap. Best regards, Pär Lundö From: slurm-users On Behalf Of Sean Crosby Sent: den 27 februari 2020 10:26 To: Slurm User Community List Subject: Re: [slurm-users] Slurm 19.05 X11-forwarding I remember that we had to add this to our /etc/ssh/ss

[slurm-users] Dynamically change JobFileAppend?

2020-02-27 Thread Eric Berquist
Is it possible to dynamically change JobFileAppend/open-mode behavior? I’m using EpilogSlurmctld to automatically requeue jobs that exit with a certain code, and would like to have those append rather than overwrite, but it seems blunt to set `JobFileAppend=1` and force people who want the defau

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Robert Kudyba
We figured out the issue. All of our jobs are requesting 1 GPU. Each node only has 1 GPU. Thus, the jobs that are pending are pending based on:, resources - meaning "no resources are available for these jobs", meaning "I want a GPU, but there are no GPUs that I can use until a job on a node finish

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Renfro, Michael
If that 32 GB is main system RAM, and not GPU RAM, then yes. Since our GPU nodes are over-provisioned in terms of both RAM and CPU, we end up using the excess resources for non-GPU jobs. If that 32 GB is GPU RAM, then I have no experience with that, but I suspect MPS would be required. > On Fe

Re: [slurm-users] Slurm version 20.02.0 is now available

2020-02-27 Thread Angelos Ching
Hi all, Looks like using --config-server limits to 1 config server if I'm not mistaken? Specifying multiple --config-server will cause slurmd to consider only the last one. (A quick glance at the source seems to agree) Any plan on accepting a second server via command line options? Thanks & r

[slurm-users] Is there a select plugin API that gets called when a job is or has been queued?

2020-02-27 Thread Dean Schulze
This is a code level question. I'm writing a select plugin and I want the plugin to take some action when a job is going to be or has been queued instead of run immediately. Does one of the select plugin APIs get called in either case? I was trying to check for this in select_p_job_test() but it

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Robert Kudyba
> > If that 32 GB is main system RAM, and not GPU RAM, then yes. Since our GPU > nodes are over-provisioned in terms of both RAM and CPU, we end up using > the excess resources for non-GPU jobs. > No it's GPU RAM > If that 32 GB is GPU RAM, then I have no experience with that, but I > suspect MP

Re: [slurm-users] Slurm 17.11 and configuring backfill and oversubscribe to allow concurrent processes

2020-02-27 Thread Christopher Samuel
On 2/27/20 11:23 AM, Robert Kudyba wrote: OK so does SLURM support MPS and if so what version? Would we need to enable cons_tres and use, e.g., --mem-per-gpu? Slurm 19.05 (and later) supports MPS - here's the docs from the most recent release of 19.05: https://slurm.schedmd.com/archive/slur

[slurm-users] How to show state of CLOUD nodes

2020-02-27 Thread Carter, Allan
I'm setting up an EC2 SLURM cluster and when an instance doesn't resume fast enough I get an error like: node c7-c5-24xl-464 not resumed by ResumeTimeout(600) - marking down and power_save I keep running into issues where my cloud nodes do not show up in sinfo and I can't display their informa

[slurm-users] Problem with configuration CPU/GPU partitions

2020-02-27 Thread Pavel Vashchenkov
Hello, I have a hybrid cluster with 2 GPUs and 2 20-cores CPUs on each node. I created two partitions: - "cpu" for CPU-only jobs which are allowed to allocate up to 38 cores per node - "gpu" for GPU-only jobs which are allowed to allocate up to 2 GPUs and 2 CPU cores. Respective sections in slur