[slurm-users] Re: Node CPU Allocation and Effective CPU Allocation for Running Jobs

2024-03-17 Thread Mark Hahn via slurm-users
You appear to have HT/SMT enabled, so I would guess Slurm is treating the node as 256 threads, 128 cpus. In other words, it'll depend on how jobs request resources (by thread or by core). You can force Slurm to ignore this distinction, if that's what you really want. regards, mark ha

Re: [slurm-users] how to know the real utilization of a node when oversubscribe is set to FORCE

2020-07-16 Thread Mark Hahn
srun -N 1 -n 1 -p testA sleep 10 then the cpurawtime of this job recorded by slurm is 640s, but actually this job only used 10s; so, I want to know are there any way to get the real cputime used by this job in slurm. if you really mean cpu time (compute-bound, not elapsed), then don't you just w

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-19 Thread Mark Hahn
might want login nodes of different clusters to trust each other. The big win is that you entirely avoid the presence of private keys on the cluster. We've used this widely in ComputeCanada since about 2003. regards, mark hahn.

Re: [slurm-users] how to restrict jobs

2020-05-06 Thread Mark Hahn
Is there no way to set or define a custom variable like at node level and you could use a per-node Feature for this, but a partition would also work.

Re: [slurm-users] [External] slurmd: error: Node configuration differs from hardware: CPUs=24:48(hw) Boards=1:1(hw) SocketsPerBoard=2:2(hw)

2020-04-23 Thread Mark Hahn
bbles. (So only helps when the workload doesn't keep all resources busy.) regards, mark hahn.

Re: [slurm-users] Srun not setting DISPLAY with --x11 for one account

2020-01-27 Thread Mark Hahn
it's worth noting that host-based trust has a lot of nice properties for this kind of intra-cluster authentication. and in particular, you don't need fragile and potentially dangerous keys sitting around. regards, mark hahn -- operator may differ from spokesperson.

Re: [slurm-users] Virtual memory size requested by slurm

2020-01-26 Thread Mark Hahn
As a follow up to my last problem, I would like to know how can I tell slurm to increase the virtual memory size for a process? have you read the messages on this list? first, you can ask for the correct amount of memory. this approach assumes that it is dangerous to allow VSZ > RSS. that's c

Re: [slurm-users] blastx fails with "Error memory mapping"

2020-01-24 Thread Mark Hahn
d a large file. According to [1], " No value is provided by cgroups for virtual memory size ('vsize') " [1] https://slurm.schedmd.com/slurm.conf.html depends on whether "ConstrainSwapSpace=yes" appears in cgroup.conf. (it's yes on the system above) rega

Re: [slurm-users] blastx fails with "Error memory mapping"

2020-01-24 Thread Mark Hahn
he cgroup control to tell the kernel how much memory to permit the job-step to use. I would like to locally solve the problem for blast and I am not seeking a system wide solution right now. there's nothing unique about your system or blast (which is extremely common on many large slurm installs). regards, mark hahn

Re: [slurm-users] blastx fails with "Error memory mapping"

2020-01-24 Thread Mark Hahn
ove you're referring to the base cgroup, not the cgroup for your job.) of course, manually fighting Slurm is a Fairly Bad Idea. you should read the documentation on cgroups to understand how these work. memsw basically corresponds to VSZ in ps, whereas mem corresponds with RSS. regards, mark hahn.

Re: [slurm-users] slurm reporting

2019-11-29 Thread Mark Hahn
different, ad-hoc format), and partly to keep systems loosely coupled. regards, mark hahn -- operator may differ from spokesperson. h...@mcmaster.ca

Re: [slurm-users] slurm reporting

2019-11-26 Thread Mark Hahn
that if you want, you can write a 10-line python script to generate a report (maybe joining data in a way Grafana doesn't let you.) Or if you want to create automated actions (email notice, etc), even mods to Slurm controls. regards, -- Mark Hahn | SHARCnet Sysadmin | h...@sharcn

Re: [slurm-users] Array jobs vs. many jobs

2019-11-24 Thread Mark Hahn
sorting the pending queue. to some extent, they let you view a set of jobs as a unit, but you can also organize sets of jobs via jobname. regards, mark hahn -- operator may differ from spokesperson. h...@mcmaster.ca

Re: [slurm-users] Example 16 of CPU Management User and Administrator Guide does not work.

2019-11-20 Thread Mark Hahn
try CoresPerSocket=12 here, to match the provided lscpu? (normally also ThreadsPerCore=2, since HT is enabled.) regards, -- Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Can

Re: [slurm-users] How to use a pyhon virtualenv with srun?

2019-11-18 Thread Mark Hahn
indeed, the install would have to be performed via srun. regards, -- Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada| http://www.computecanada.ca

Re: [slurm-users] Help with preemtion based on licenses

2019-11-06 Thread Mark Hahn
g and substracting is not. regards, mark hahn.

Re: [slurm-users] Help with preemtion based on licenses

2019-11-05 Thread Mark Hahn
llout to query the number of free licenses, and consider a job eligible to start if its declared usage fits (gres in Slurm terms, I think). regards, mark hahn -- operator may differ from spokesperson. h...@mcmaster.ca

Re: [slurm-users] job priority keeping resources from being used?

2019-11-01 Thread Mark Hahn
In theory, these small jobs could slip in and run alongside the large jobs, what are your SelectType and SelectTypeParameters settings? ExclusiveUser=YES on partitions? regards, mark hahn.

Re: [slurm-users] calculate license tokens from cpus

2019-10-29 Thread Mark Hahn
hat script calculates and sets the --licenses. I would like to expose user to slurm as much as possible, and use scripts as little as possible. Well, that exposes it to a lot of error and possibly abuse. regards, -- Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca

Re: [slurm-users] OverMemoryKill Not Working?

2019-10-25 Thread Mark Hahn
need. I simply want to enforce the memory limits as specified by the user at job submission time. This seems to have been the behavior in previous but cgroups (with Constrain) do that all by themselves. If someone could post just a simple slurm.conf file that forces the memory limits to be

Re: [slurm-users] How to find core count per job per node

2019-10-18 Thread Mark Hahn
this! we've had a lot of discussion on how to collect this information as well, even whether it would be worth doing in a prolog script... regards, -- Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x

Re: [slurm-users] Slurm very rarely assigned an estimated start time to a job

2019-10-02 Thread Mark Hahn
er to work well, it needs a particular distribution of job priorities. regards, mark hahn.

Re: [slurm-users] MPI jobs via mirun vs. srun through PMIx.

2019-09-17 Thread Mark Hahn
dea to use the pam adopt-to-slurm plugin, which makes even scheduler-oblivious mpirun behave better. regards, mark hahn

Re: [slurm-users] MaxRSS not showing up in sacct

2019-09-15 Thread Mark Hahn
When I use sacct to show job stats, it always has a blank entry for the MaxRSS field. Is there something that needs enabled to get that in? missing for steps as well or only when using --allocations? regards, mark hahn.

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-29 Thread Mark Hahn
he resources, as long as it's only what's allocated to their jobs, doesn't interfere with other users, and is hopefully reasonably efficient. heck, we configure clusters with hostbased trust, so it's easy for users to ssh among nodes. regards, -- Mark Hahn | SHARCnet Sysadmi

Re: [slurm-users] Slurm 19.05 --workdir non existent?

2019-08-15 Thread Mark Hahn
ption parsing. regards, -- Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada| http://www.computecanada.ca

Re: [slurm-users] [Long] Why are tasks started on a 30 second clock?

2019-07-25 Thread Mark Hahn
I'll be very grateful if anyone can explain where does the 30-second clock hide! how about a timeout from elsewhere? for instance, when I see a 30s delay, I normally at least check DNS, which can introduce such quantized delays. regards, mark hahn.

Re: [slurm-users] No error/output/run

2019-07-24 Thread Mark Hahn
on the compute nodes? over-quota even? I would certainly examine the slurm logs on the compute nodes. regards, mark hahn.

Re: [slurm-users] GPU machines only run a single GPU job despite resources being available.

2019-07-16 Thread Mark Hahn
#!/bin/bash #SBATCH -c 2 #SBATCH -o slurm-gpu-job.out #SBATCH -p gpu.q #SBATCH -w mk-gpu-1 #SBATCH --gres=gpu:1 could it be that sbatch is defaulting to --mem=0, meaning "all the node's memory"? regards, mark hahn.

Re: [slurm-users] pam_slurm_adopt and memory constraints?

2019-07-15 Thread Mark Hahn
/step_extern 4:devices:/slurm/uid_3000566/job_17268219/step_extern 3:cpuset:/slurm/uid_3000566/job_17268219/step_extern 2:cpuacct,cpu:/ 1:name=systemd:/system.slice/sshd.service regards, mark hahn

Re: [slurm-users] Running pyMPI on several nodes

2019-07-12 Thread Mark Hahn
look at slurmd logs on the nodes. regards, mark hahn.

Re: [slurm-users] Specify number of cores only?

2019-07-10 Thread Mark Hahn
Is there a way to instruct SBATCH to submit a job with a certain number of cores without specifying anything else? I don?t care which nodes or sockets they run on. They would only use on thread per core. not just --ntasks? regards, mark hahn.

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-13 Thread Mark Hahn
would have expected a different approach: use a unique string for the jobname, and always verify after submission. after all, squeue provides a --name parameter for this (efficient query by logical job "identity"). regards, mark hahn.

Re: [slurm-users] Feature request: create a job id before job submission

2019-05-07 Thread Mark Hahn
name somewhat richer (username, account, etc) regards, mark hahn.

Re: [slurm-users] Feature request: create a job id before job submission

2019-05-07 Thread Mark Hahn
eric jobid, and why would configuring the scratch space be too slow to perform in the job prolog? regards, mark hahn.

Re: [slurm-users] Job dispatching policy

2019-04-30 Thread Mark Hahn
Also why aren't you using the Slurm commands to run things? Which command? srun or sbatch

Re: [slurm-users] job startup timeouts?

2019-04-27 Thread Mark Hahn
difference between 1 task/node and all threads/node? regards, mark hahn.

Re: [slurm-users] Priority access for a group of users

2019-03-01 Thread Mark Hahn
our). I would be interested to know whether other Slurm sites do this successfully, particularly in avoiding the victim-stays-suspended priority inversion. thanks, -- Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525

Re: [slurm-users] Kinda Off-Topic: data management for Slurm clusters

2019-02-22 Thread Mark Hahn
tem on ssd)? https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/storage_administration_guide/ch-fscache if files are being re-read, this would be effective, fast, and convenient, and wouldn't require any staging or hooks into Slurm. regards, mark hahn -- operator

Re: [slurm-users] salloc with bash scripts problem

2019-01-02 Thread Mark Hahn
nteractive (and salloc's is). this may affect partition choice, etc. regards, mark hahn.

Re: [slurm-users] salloc with bash scripts problem

2019-01-02 Thread Mark Hahn
-requiring script. why not just: salloc --x11 srun ./whateveryourscriptwas regards, -- Mark Hahn | SHARCnet Sysadmin | h...@sharcnet.ca | http://www.sharcnet.ca | McMaster RHPCS| h...@mcmaster.ca | 905 525 9140 x24687 | Compute/Calcul Canada| http

Re: [slurm-users] requesting resources and afterwards launch an array of calculations

2018-12-19 Thread Mark Hahn
but are not bothering the scheduler during the job. an alternative would be to run something like GNU Parallel within the job. regards, mark hahn. -- operator may differ from spokesperson. h...@mcmaster.ca

Re: [slurm-users] About x11 support

2018-11-23 Thread Mark Hahn
erver, and also on the same (trusted/routable) network as the compute node? In which case you don't want Slurm doing anything at all. Just let the X client read DISPLAY from the environment propagated by Slurm. regards, mark hahn

Re: [slurm-users] Over-subscription for a GRES type

2018-11-23 Thread Mark Hahn
We have a use-case in that the GRES being tracked on a particular partition are GPU cards, but aren't being used by applications that would require them exclusively (lightweight direct rendering rather than GP-GPU/CUDA the issue is that slurm/kernel can't arbitrate resources on the GPU, so overs