[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread wdennis--- via slurm-users
Thanks for the logical explanation, Paul. So when I rewrite my user documentation, I'll mention using `salloc` instead of `srun`. Yes, we do have `LaunchParameters=use_interactive_step` set on our cluster, so salloc gives a shell on the allocated host. Best, Will -- slurm-users mailing list -

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users
Most of my stuff is in the cloud, so I use their load balancing services. HAProxy does have sticky sessions, which you can enable based on IP so it works with other protocols: 2 Ways to Enable Sticky Sessions in HAProxy (Guide)

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Cutts, Tim via slurm-users
HAProxy, for on-prem things. In the cloud I just use their load balancers rather than implement my own. Tim -- Tim Cutts Scientific Computing Platform Lead AstraZeneca Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Catalogue

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Dan Healy via slurm-users
Are most of us using HAProxy or something else? On Wed, Feb 28, 2024 at 3:38 PM Brian Andrus via slurm-users < slurm-users@lists.schedmd.com> wrote: > Magnus, > > That is a feature of the load balancer. Most of them have that these days. > > Brian Andrus > > On 2/28/2024 12:10 AM, Hagdorn, Magnus

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Brian Andrus via slurm-users
Magnus, That is a feature of the load balancer. Most of them have that these days. Brian Andrus On 2/28/2024 12:10 AM, Hagdorn, Magnus Karl Moritz via slurm-users wrote: On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote: for us, we put a load balancer in front of the login

[slurm-users] Re: Enforcing relative resource restrictions in submission script

2024-02-28 Thread Jason Simms via slurm-users
Hello Matthew, You may be aware of this already, but most sites would make these kinds of checks/validations using job_submit.lua. I'm not an expert in that - though plenty of others on this list are - but I'm positive you could implement this type of validation logic. I'd like to say that I've co

[slurm-users] Re: pty jobs are killed when another job on the same node terminates

2024-02-28 Thread Jason Simms via slurm-users
Hello Thomas, I know I'm a few days late to this, so I'm wondering whether you've made any progress. We experience this, too, but in a different way. First, though, you may be aware, but you should use salloc rather than srun --pty for an interactive session. That's been the preferred method for

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
He's talking about recent versions of Slurm which now have this option: https://slurm.schedmd.com/slurm.conf.html#OPT_use_interactive_step -Paul Edmon- On 2/28/2024 10:46 AM, Paul Raines wrote: What do you mean "operate via the normal command line"?  When you salloc, you are still on the logi

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Raines via slurm-users
What do you mean "operate via the normal command line"? When you salloc, you are still on the login node. $ salloc -p rtx6000 -A sysadm -N 1 --ntasks-per-node=1 --mem=20G --time=1-10:00:00 --gpus=2 --cpus-per-task=2 /bin/bash salloc: Pending job allocation 3798364 salloc: job 3798364 queued

[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-28 Thread Josef Dvoracek via slurm-users
> I'm running slurm 22.05.11 which is available with OpenHCP 3.x > Do you think an upgrade is needed? I feel that lot of slurm operators tend to not use 3rd party sources of slurm binaries, as you do not have the build environment fully in your hands. But before making such a complex decision

[slurm-users] Re: salloc+srun vs just srun

2024-02-28 Thread Paul Edmon via slurm-users
salloc is the currently recommended way for interactive sessions. srun is now intended for launching steps or MPI applications. So properly you would salloc and then srun inside the salloc. As you've noticed with srun you tend lose control of your shell as it takes over so you have background

[slurm-users] salloc+srun vs just srun

2024-02-28 Thread wdennis--- via slurm-users
Hi list, In our institution, our instructions to users who want to spawn an interactive job (for us, a bash shell) have always been to do "srun ..." from the login node, which has always been working well for us. But when we had a recent Slurm training, the SchedMD folks advised us to use "sall

[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Josef Dvoracek via slurm-users
I think installing/upgrading "slurm" rpm will replace this shared lib. Indeed, as always, test it first at not-so-critical system, use vm snapshots to be able to travel back in time ... as once you'll upgrade DB schema (if part of upgrade) you AFAIK can not go back. josef On 28. 02. 24 15:51

[slurm-users] How to get usage data for a QOS

2024-02-28 Thread thomas.hartmann--- via slurm-users
Hi, so, I figured out that I can give some users priority access for a specific amount of TRES by creating a qos with the GrpTRESMins property and the DenyOnLimit,NoDecay flags. This works nicely. However, I would like to know, how much of this has already been consumed and I have not yet found

[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Miriam Olmi via slurm-users
Hi Josef, thanks a lot for your reply! I just checked and you are right!!! My library comes from the old version of slurm: $ rpm -q --whatprovides /usr/lib64/slurm/libslurmfull.so slurm-23.02.3-1.el8.x86_64 I installed the new version of slurm 23.11.0-1 by rpm. How can I fix this? Many thanks

[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-28 Thread Dietmar Rieder via slurm-users
Hi, I'm running slurm 22.05.11 which is available with OpenHCP 3.x Do you think an upgrade is needed? Best Dietmar On 2/28/24 14:55, Josef Dvoracek via slurm-users wrote: Hi Dietmar; I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently.. I must say that on my setup it l

[slurm-users] Re: [EXTERN] Re: sbatch and cgroup v2

2024-02-28 Thread Dietmar Rieder via slurm-users
Hi Hermann, I get: Cpus_allowed: ,, Cpus_allowed_list: 0-95 Best Dietmar p.s.: lg aus dem CCB On 2/28/24 15:01, Hermann Schwärzler via slurm-users wrote: Hi Dietmar, what do you find in the output-file of this job sbatch --time 5 --cpus-per-task=1 --wrap '

[slurm-users] Re: slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Josef Dvoracek via slurm-users
I see this question unanswered so far.. so I'll give you my 2 cents: Quick check reveals that mentioned symbol is in libslurmfull.so : [root@slurmserver2 ~]# nm -gD /usr/lib64/slurm/libslurmfull.so | grep "slurm_conf$" 000d2c06 T free_slurm_conf 000d3345 T init_slurm_conf 0

[slurm-users] Re: sbatch and cgroup v2

2024-02-28 Thread Hermann Schwärzler via slurm-users
Hi Dietmar, what do you find in the output-file of this job sbatch --time 5 --cpus-per-task=1 --wrap 'grep Cpus /proc/$$/status' On our 64 cores machines with enabled hyperthreading I see e.g. Cpus_allowed: 0400,,0400, Cpus_allowed_list: 58,122 Greetings Hermann

[slurm-users] Re: sbatch and cgroup v2

2024-02-28 Thread Josef Dvoracek via slurm-users
Hi Dietmar; I tried this on ${my cluster}, as I switched to cgroupsv2 quite recently.. I must say that on my setup it looks it works as expected, see the grepped stdout from your reproducer below. I use recent slurm 23.11.4 . Wild guess.. Has your build machine bpt and dbus devel packages in

[slurm-users] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Josef Dvoracek via slurm-users
From unclear reason "--wrap" was not part of my /repertoire/ so far. thanks On 26. 02. 24 9:47, Ward Poelmans via slurm-users wrote: sbatch --wrap 'screen -D -m' srun --jobid --pty screen -rd smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list -- slurm-users@li

[slurm-users] Re: GPU shards not exclusive

2024-02-28 Thread wdennis--- via slurm-users
Hi Reed, Unfortunately, we had the same issue with 22.05.9; SchedMD advice was to upgrade to 23.11.x, and this appears to have resolved this issue for us. SchedMD support said to us, "We did a lot of work regarding shards in the 23.11 release." HTH, Will -- slurm-users mailing list -- slurm-

[slurm-users] sbatch and cgroup v2

2024-02-28 Thread Dietmar Rieder via slurm-users
Hi, I'm new to slrum, but maybe someone can help me: I'm trying to restrict the CPU usage to the actually requested/allocated resources using cgroup v2. For this I made the following settings in slurmd.conf: ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup,task/affinity And in cgroup

[slurm-users] User-facing documentation on shard use

2024-02-28 Thread wdennis--- via slurm-users
Hello list, We have just enabled "gres/shard" in order to enable sharing of GPUs on our cluster. I am now looking for examples of user-facing documentation on this feature. If anyone has something, and can send a URL or other example, I'd appreciate it. Thanks, Will -- slurm-users mailing li

[slurm-users] slurmdbd error - Symbol `slurm_conf' has different size in shared object

2024-02-28 Thread Miriam Olmi via slurm-users
Hi all, I am having some issue with the new version of slurm 23.11.0-1. I had already installed and configured slurm 23.02.3-1 on my cluster and all the services were active and running properly. Following the instructions of the official SLURM webpage, for the moment I upgrated only the slur

[slurm-users] Partition, Qos Limits & Scheduling of large jobs

2024-02-28 Thread Muck, Katrin via slurm-users
Hi everyone! I read the slurm documentation about qos, resource limits, scheduling and priority now multiple times and even looked into the slurm source but I'm still not sure if I got everything correctly, so this is why I decided to ask here ... The problem: we see the effect that sometim

[slurm-users] Re: [ext] Re: canonical way to run longer shell/bash interactive job (instead of srun inside of screen/tmux at front-end)?

2024-02-28 Thread Hagdorn, Magnus Karl Moritz via slurm-users
On Tue, 2024-02-27 at 08:21 -0800, Brian Andrus via slurm-users wrote: > for us, we put a load balancer in front of the login nodes with > session > affinity enabled. This makes them land on the same backend node each > time. Hi Brian, that sounds interesting - how did you implement session affin