from:"Sean Crosby"

[slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld

2025-02-03 Thread Sean Crosby via slurm-users

Just double checking. Can you check on your worker node 1. ls -la /etc/pam.d/*slurm* (just checking if there's a specific pam file for slurmd on your system) 1. scontrol show config | grep -i SlurmdUser (checking if slurmd is set up with a different user to SlurmUser) 1. grep slurm /e

Re: [slurm-users] [EXT] error: Couldn't find the specified plugin name for cred/munge looking at all files

2024-01-23 Thread Sean Crosby

slurmctld runs as the user slurm, whereas slurmd runs as root. Make sure the permissions on /app/slurm-24.0.8/lib/slurm allow the user slurm to read the files e.g. you could do (as root) sudo -u slurm ls /app/slurm-24.0.8/lib/slurm and see if the slurm user can read the directory (as well as t

Re: [slurm-users] [EXT] --mem is not limiting the job's memory

2023-06-22 Thread Sean Crosby

On the worker node, check if cgroups are mounted grep cgroup /proc/mounts (normally it's in /sys/fs/cgroup ) then check if Slurm is setting up the cgroup find /sys/fs/cgroup | grep slurm e.g. [root@spartan-gpgpu164 ~]# find /sys/fs/cgroup/memory | grep slurm /sys/fs/cgroup/memory/slurm /sys/f

Re: [slurm-users] Temporary Stop User Submission

2023-05-25 Thread Sean Crosby

Hi Willy, sacctmgr modify account slurmaccount user=baduser set maxjobs=0 Sean From: slurm-users on behalf of Markuske, William Sent: Friday, 26 May 2023 09:16 To: slurm-users@lists.schedmd.com Subject: [EXT] [slurm-users] Temporary Stop User Submission Ext

Re: [slurm-users] How to open a slurm support case

2022-03-24 Thread Sean Crosby

Hi Jeff, The support system is here - https://bugs.schedmd.com/ Create an account, log in, and when creating a request, select your site from the Site selection box. Sean From: slurm-users on behalf of Jeffrey R. Lang Sent: Friday, 25 March 2022 08:48 To: slu

Re: [slurm-users] [EXT] Re: systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Sean Crosby

Did you build Slurm yourself from source? If so, when you build from source, on that node, you need to have the munge-devel package installed (munge-devel on EL systems, libmunge-dev on Debian) You then need to set up munge with a shared munge key between the nodes, and have the munge daemon ru

slurm-users@lists.schedmd.com

2022-01-14 Thread Sean Crosby

Any error in slurmd.log on the node or slurmctld.log on the ctl? Sean From: slurm-users on behalf of Wayne Hendricks Sent: Saturday, 15 January 2022 16:04 To: slurm-us...@schedmd.com Subject: [EXT] [slurm-users] Strange sbatch error with 21.08.2&5 External ema

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-03 Thread Sean Crosby

n of Mariadb are you using? Brian Andrus On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote: After installation of libmariadb-dev, I have reinstalled the entire slurm with ./configure + options, make, and make install. Still, accounting_storage_mysql.so is missing. On Sat, Dec 4, 2021 at 12:24 A

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-03 Thread Sean Crosby

Did you run ./configure (with any other options you normally use) make make install on your DBD server after you installed the mariadb-devel package? From: slurm-users on behalf of Giuseppe G. A. Celano Sent: Saturday, 4 December 2021 10:07 To: Slurm User Comm

Re: [slurm-users] [EXT] Re: Missing data in sreport for a time period in slurm

2021-10-21 Thread Sean Crosby

Sent: Thursday, 21 October 2021 21:54 To: slurm-users@lists.schedmd.com ; Sean Crosby Subject: Re: [EXT] Re: [slurm-users] Missing data in sreport for a time period in slurm External email: Please exercise caution Hi Sean, After changing those values yesterda

Re: [slurm-users] [EXT] Re: Missing data in sreport for a time period in slurm

2021-10-18 Thread Sean Crosby

; Sean Crosby Subject: Re: [EXT] Re: [slurm-users] Missing data in sreport for a time period in slurm External email: Please exercise caution Dear All, By checking the value of last ran table, hourly rollup shows today's

Re: [slurm-users] [EXT] Re: Missing data in sreport for a time period in slurm

2021-10-18 Thread Sean Crosby

sreport keeps a track of when it has done the last rollup calculations in the database. Open MySQL for your Slurm accounting database, do select * from slurm_acct_db.clustername_last_ran_table; where slurm_acct_db is your accounting database name (slurm_acct_db is default), and clustername is

Re: [slurm-users] [EXT] User association with partition and Qos

2021-08-31 Thread Sean Crosby

s root ? Can this be an issue Amjad On Tue, Aug 31, 2021 at 8:22 AM Sean Crosby mailto:scro...@unimelb.edu.au>> wrote: What does sacctmgr show for the user you added to have access to the QoS, and what does Slurm show for the partition config? sacctmgr show account withassoc -p scontr

Re: [slurm-users] [EXT] User association with partition and Qos

2021-08-31 Thread Sean Crosby

...@gmail.com>> wrote: Hi Sean, Thanks for the suggestion, seems to work now. Majid On Fri, Aug 27, 2021 at 12:56 PM Sean Crosby mailto:scro...@unimelb.edu.au>> wrote: Hi Amjad, Make sure you have qos in the config entry AccountingStorageEnforce e.g. AccountingStorageEnforce=associa

Re: [slurm-users] [EXT] Re: EXTERNAL-Re: [External] scancel gpu jobs when gpu is not requested

2021-08-30 Thread Sean Crosby

Hi Fritz, job_submit_lua.so gets made upon compilation of Slurm if you have the lua-devel package installed at the time of configure/make. Sean From: slurm-users on behalf of Ratnasamy, Fritz Sent: Tuesday, 31 August 2021 15:05 To: Slurm User Community List S

Re: [slurm-users] [EXT] User association with partition and Qos

2021-08-27 Thread Sean Crosby

Hi Amjad, Make sure you have qos in the config entry AccountingStorageEnforce e.g. AccountingStorageEnforce=associations,limits,qos,safe Sean From: slurm-users on behalf of Amjad Syed Sent: Friday, 27 August 2021 20:28 To: slurm-us...@schedmd.com Subject: [

Re: [slurm-users] [EXT] slurmctld.log over 500 MB

2021-07-27 Thread Sean Crosby

Hi Felix, >From one of the recent Slurm user group meetings, the recommended way to >logrotate the Slurm logs is to send SIGUSR2. My logrotate entry is /var/log/slurm/slurmctld.log { compress missingok nocopytruncate nocreate delaycompress nomail notifempty noolddir rotate 5

Re: [slurm-users] problem building pam_slurm_adopt

2021-07-14 Thread Sean Crosby

Hi Mike, To build pam_slurm_adopt, you need the pam-devel package installed on the node you're building Slurm on. On RHEL, it's pam-devel, and Debian it's libpam-dev Once you have installed that, do ./configure again, and then you should be able to make the pam_slurm_adopt Sean __

Re: [slurm-users] [EXT] incorrect number of cpu's being reported in srun job

2021-06-17 Thread Sean Crosby

Hi Sid, On our cluster, it performs just like your PBS cluster. $ srun -N 1 --cpus-per-task 8 --time 01:00:00 --mem 2g --partition physicaltest -q hpcadmin --pty python3 srun: job 27060036 queued and waiting for resources srun: job 27060036 has been allocated resources Python 3.6.8 (default, Aug

Re: [slurm-users] [EXT] Re: Is there a scontrol ping slurmdbd?

2021-06-10 Thread Sean Crosby

We use sacctmgr list stats for our Slurmdbd check Our Nagios check is RESULT=$(/usr/local/slurm/latest/bin/sacctmgr list stats) if [ $? -ne 0 ] then echo "ERROR: cannot connect to database" exit 2 fi echo "$RESULT" | head -n 4 exit 0 Sean From: sl

Re: [slurm-users] [EXT] rejecting jobs that exceed QOS limits

2021-05-28 Thread Sean Crosby

Hi Paul, Try sacctmgr modify qos gputest set flags=DenyOnLimit Sean From: slurm-users on behalf of Paul Raines Sent: Saturday, 29 May 2021 12:48 To: slurm-users@lists.schedmd.com Subject: [EXT] [slurm-users] rejecting jobs that exceed QOS limits External ema

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-04-12 Thread Sean Crosby

ling=256 >AllocTRES= >CapWatts=n/a >CurrentWatts=0 AveWatts=0 >ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s >Comment=(null) > > > > > On Sun, Apr 11, 2021 at 2:03 AM Sean Crosby > wrote: > >> Hi Cristobal, >> >> My hunch is

Re: [slurm-users] [EXT] [Beginner, SLURM 20.11.2] Unable to allocate resources when specifying gres in srun or sbatch

2021-04-10 Thread Sean Crosby

Hi Cristobal, My hunch is it is due to the default memory/CPU settings. Does it work if you do srun --gres=gpu:A100:1 --cpus-per-task=1 --mem=10G nvidia-smi Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of

Re: [slurm-users] [EXT] slurmctld error

2021-04-08 Thread Sean Crosby

name resolution works. You have set the names in Slurm to be wn001-wn044, so every node has to be able to resolve those names. Hence the check using ping Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Vic

Re: [slurm-users] [EXT] slurmctld error

2021-04-08 Thread Sean Crosby

node Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Thu, 8 Apr 2021 at 16:38, Ioannis Botsis wrote: > * UoM notice: External email. Be cautious of links, at

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby

I just checked my cluster and my spool dir is SlurmdSpoolDir=/var/spool/slurm (i.e. without the d at the end) It doesn't really matter, as long as the directory exists and has the correct permissions on all nodes -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Comp

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby

rectory on all your nodes. It needs to be owned by user slurm ls -lad /var/spool/slurmd Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Tue, 6 Apr 2021 at 20:37, Sean Cro

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby

ow cluster If that doesn't work, try changing AccountingStorageHost in slurm.conf to localhost as well For your worker nodes, your nodes are all in drain state. Show the output of scontrol show node wn001 It will give you the reason for why the node is drained. Sean -- Sean Crosby | S

Re: [slurm-users] [EXT] slurmctld error

2021-04-06 Thread Sean Crosby

It looks like your attachment of sinfo -R didn't come through It also looks like your dbd isn't set up correctly Can you also show the output of sacctmgr list cluster and scontrol show config | grep ClusterName Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lea

Re: [slurm-users] [EXT] slurmctld error

2021-04-05 Thread Sean Crosby

The other thing I notice for my slurmdbd.conf is that I have DbdAddr=localhost DbdHost=localhost You can try changing your slurmdbd.conf to set those 2 values as well to see if that gets slurmdbd to listen on port 6819 Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research

Re: [slurm-users] [EXT] slurmctld error

2021-04-05 Thread Sean Crosby

Interesting. It looks like slurmdbd is not opening the 6819 port What does ss -lntp | grep 6819 show? Is something else using that port? You can also stop the slurmdbd service and run it in debug mode using slurmdbd -D -vvv Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead

Re: [slurm-users] [EXT] slurmctld error

2021-04-05 Thread Sean Crosby

What's the output of ss -lntp | grep $(pidof slurmdbd) on your dbd host? Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Tue, 6 Apr 2021 at 05:00, wrote: &g

Re: [slurm-users] [EXT] slurmctld error

2021-04-05 Thread Sean Crosby

try connecting to port 6819 on the host 10.0.0.100, and output nothing if the connection works, and would output Connection not working otherwise I would also test this on the DBD server itself -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business

Re: [slurm-users] [EXT] slurmctld error

2021-04-05 Thread Sean Crosby

out the lines AccountingStorageUser=slurm AccountingStoragePass=/run/munge/munge.socket.2 You shouldn't need those lines Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia O

Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-16 Thread Sean Crosby

gt; David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel > dw...@drexel.edu 215.571.4335 (o) > For URCF support: urcf-supp...@drexel.edu > https://proteusmaster.urcf.drexel.edu/urcfwiki > github:prehensilecode > > > -- > *From:* slurm

Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Sean Crosby

What are your Slurm settings - what's the values of ProctrackType JobAcctGatherType JobAcctGatherParams and what's the contents of cgroup.conf? Also, what version of Slurm are you using? Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services

Re: [slurm-users] [EXT] Re: [External] maxRSS and aveRSS

2021-03-12 Thread Sean Crosby

On Sat, 13 Mar 2021 at 08:48, Prentice Bisbal wrote: > * UoM notice: External email. Be cautious of links, attachments, or > impersonation attempts * > -- > > It sounds like your confusing job steps and tasks. For an MPI program, > tasks and MPI ranks are the same thin

Re: [slurm-users] [EXT] Is it possible to set a default QOS per partition?

2021-03-01 Thread Sean Crosby

r QoS, set the OverPartQOS flag, and get the users to specify that QoS. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Tue, 2 Mar 2021 at 08:24, Stack Korora wrote: &g

Re: [slurm-users] [EXT] How to determine (on the ControlMachine) which cores/gpus are assigned to a job?

2021-02-12 Thread Sean Crosby

~]# cat /sys/fs/cgroup/cpuset/slurm/uid_11470/job_24115684/cpuset.cpus 58 I will keep searching. I know we capture the real CPU ID as well, using daemons running on the worker nodes, and we feed that into Ganglia. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing

Re: [slurm-users] [EXT] How to determine (on the ControlMachine) which cores/gpus are assigned to a job?

2021-02-05 Thread Sean Crosby

Licenses=(null) Network=(null) Note the CPU_IDs and GPU IDX in the output Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Fri, 5 Feb 2021 at 02:01, Thomas Zeiser

Re: [slurm-users] [EXT] wrong number of jobs used

2021-01-19 Thread Sean Crosby

It shows that for this node, it has 72 cores and 1.5TB RAM (the CfgTRES part), and currently jobs are using 72 cores, and 442GB RAM. I would run the same command on 4 or 5 of the nodes on your cluster, and we'll have a better idea about what's going on. Sean -- Sean Crosby | Senior Dev

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Sean Crosby

MSpace=yes ConstrainSwapSpace=yes ConstrainDevices=yes TaskAffinity=no CgroupMountpoint=/sys/fs/cgroup The ConstrainDevices=yes is the key to stopping jobs from having access to GPUs they didn't request. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Com

Re: [slurm-users] [EXT] slurm/munge problem: invalid credentials

2020-12-16 Thread Sean Crosby

contact the new compute node on SlurmdPort. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Wed, 16 Dec 2020 at 03:48, Olaf Gellert wrote: > UoM notice: External email.

Re: [slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?

2020-12-07 Thread Sean Crosby

Hi Loris, We have a completely separate test system, complete with a few worker nodes, separate slurmctld/slurmdbd, so we can test Slurm upgrades etc. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne

Re: [slurm-users] [EXT] job_submit.lua - choice of error on failure / job_desc.gpus?

2020-12-04 Thread Sean Crosby

NO_VAL) then slurm.user_msg("--gpus-per-task option requires --tasks specification") return ESLURM_BAD_TASK_COUNT end end end end end end end Let me know if you improve it

Re: [slurm-users] [EXT] Re: [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-30 Thread Sean Crosby

nodes, try communicating with the other Slurmd's e.g. from SRVGRIDSLURM01 do nc -z SRVGRIDSLURM02 6818 || echo Cannot communicate nc -z srvgridslurm03 6818 || echo Cannot communicate Replace 6818 with the port you get from the scontrol show config command earlier Sean -- Sean Crosby | S

Re: [slurm-users] [EXT] Slurmd problem on client

2020-08-24 Thread Sean Crosby

Make sure slurmd on the client is stopped, and then run it in verbose mode in the foreground e.g. /usr/local/slurm/latest/sbin/slurmd -D -v Then post the output -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of

Re: [slurm-users] [EXT] Slurmd problem on client

2020-08-24 Thread Sean Crosby

Hi Lars, Do the regular slurm commands work from the client? e.g. squeue scontrol show part If they don't, it would be a sign of communication problems. Is there a software firewall running on the master/client? Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Res

Re: [slurm-users] [EXT] Weird issues with slurm's Priority

2020-07-08 Thread Sean Crosby

fying timelimit accurately) means that cores will go idle when there are jobs that could use them. If you're happy with that, then all is fine. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Vic

Re: [slurm-users] [EXT] Jobs Immediately Fail for Certain Users

2020-07-07 Thread Sean Crosby

$? 1 Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Wed, 8 Jul 2020 at 01:14, Jason Simms wrote: > *UoM notice: External email. Be cautious of links, attachments,

Re: [slurm-users] [EXT] Weird issues with slurm's Priority

2020-07-07 Thread Sean Crosby

Can you see if it is set? Using (e.g. scontrol show job 337475 or sacct -j 337475 -o Timelimit) Sean > > Thanks again > > On Tue, Jul 7, 2020 at 11:39 AM Sean Crosby > wrote: > >> Hi, >> >> What you have described is how the backfill scheduler works. If a lower

Re: [slurm-users] [EXT] Weird issues with slurm's Priority

2020-07-07 Thread Sean Crosby

y job from starting in its original time. In your example job list, can you also list the requested times for each job? That will show if it is the backfill scheduler doing what it is designed to do. Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services

Re: [slurm-users] [EXT] Re: Module "pam_slurm_adopt"

2020-07-01 Thread Sean Crosby

You have to install the pam-devel package on the server you use to build Slurm on. You'll then need to configure and then make. Then you'll be able to make the files in the contrib/pam_slurm_adopt folder Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research

Re: [slurm-users] [EXT] Set a per-cluster default limit of the number of active cores per user at a time

2020-06-19 Thread Sean Crosby

could have different QoS names for all the partitions across all of your clusters, and set the limits on the QoS? Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Sat, 20 Jun

Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-06 Thread Sean Crosby

Do you have other limits set? The QoS is hierarchical, and especially partition QoS can override other QoS. What's the output of sacctmgr show qos -p and scontrol show part Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Service

Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

2020-05-05 Thread Sean Crosby

Hi Thomas, That value should be sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4 Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Wed, 6 May 2020 at 04:53, Theis

Re: [slurm-users] [EXTERNAL] CentOS 7 CUDA 8.0 can't find plugin cons_tres

2020-04-16 Thread Sean Crosby

Hi Lisa, cons_tres is part of Slurm 19.05 and higher. As you are using Slurm 18.08, it won't be there. The select plugin for 18.05 is cons_res. Is there a reason why you're using an old Slurm? Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computin

Re: [slurm-users] [EXTERNAL] Re: Munge decode failing on new node

2020-04-15 Thread Sean Crosby

Who owns the munge directory and key? Is it the right uid/gid? Is the munge daemon running? -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Thu, 16 Apr 2020 at 04:57, Dean

Re: [slurm-users] Accounting Information from slurmdbd does not reach slurmctld

2020-03-23 Thread Sean Crosby

What happens if you change AccountingStorageHost=localhost to AccountingStorageHost=192.168.1.1 i.e. same IP address as your ctl, and restart the ctld Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne

Re: [slurm-users] Slurm 19.05 X11-forwarding

2020-02-27 Thread Sean Crosby

-- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Computing Services | Business Services The University of Melbourne, Victoria 3010 Australia On Wed, 26 Feb 2020 at 20:52, Pär Lundö mailto:par.lu...@foi.se>> wrote: Hi, Thank you for your quick replies. Please bear with m

Re: [slurm-users] Question about memory allocation

2019-12-17 Thread Sean Crosby

What services did you restart after changing the slurm.conf? Did you do an scontrol reconfigure? Do you have any reservations? scontrol show res Sean On Tue, 17 Dec. 2019, 10:35 pm Mahmood Naderan, mailto:mahmood...@gmail.com>> wrote: >Your running job is requesting 6 CPUs per node (4 nodes, 6

Re: [slurm-users] Question about memory allocation

2019-12-17 Thread Sean Crosby

Hi Mahmood, Your running job is requesting 6 CPUs per node (4 nodes, 6 CPUs per node). That means 6 CPUs are being used on node hpc. Your queued job is requesting 5 CPUs per node (4 nodes, 5 CPUs per node). In total, if it was running, that would require 11 CPUs on node hpc. But hpc only has 1

Re: [slurm-users] srun: Error generating job credential

2019-10-08 Thread Sean Crosby

Looking at the SLURM code, it looks like it is failing with a call to getpwuid_r on the ctld What is (on slurm-master): getent passwd turing getent passwd 1000 Sean -- Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead Research Platform Services | Business Services CoEPP Research

Re: [slurm-users] Slurm node weights

2019-07-25 Thread Sean Crosby

Hi David, What does: scontrol show node orange01 scontrol show node orange02 show? Just to see if there's a default node weight hanging around, and if your weight changes have been picked up. Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Ser

Re: [slurm-users] pam_slurm_adopt and memory constraints?

2019-07-17 Thread Sean Crosby

se_uid session required pam_unix.so Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services Research Computing | CoEPP | School of Physics University of Melbourne On Wed, 17 Jul 2019 at 21:05, Andy Georges mailto:andy.geor...@ugent.be>> wr

Re: [slurm-users] Give priority to specific server

2019-07-14 Thread Sean Crosby

scheduling individually. The default value is 1. Add Weight=1000 to the serv1 line, and serv2 should be given the job first. Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services Research Computing | CoEPP | School of Physics University of Melbourne On Sun, 1

Re: [slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

2019-06-06 Thread Sean Crosby

How did you compile SLURM? Did you add the contribs/pmi and/or contribs/pmi2 plugins to the install? Or did you use PMIx? Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services Research Computing | CoEPP | School of Physics University of Melbourne On Thu

Re: [slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

2019-06-06 Thread Sean Crosby

Hi Andrés, Did you recompile OpenMPI after updating to SLURM 19.05? Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services Research Computing | CoEPP | School of Physics University of Melbourne On Thu, 6 Jun 2019 at 20:11, Andrés Marín Díaz mailto:ama

Re: [slurm-users] 19.05.0 x11 in sbatch

2019-05-29 Thread Sean Crosby

de has been revamped, and no longer relies on libssh2 to function. However, support for --x11 alongside sbatch has been removed, as the new forwarding code relies on the allocating salloc or srun command to process the forwarding. Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Re

Re: [slurm-users] Issue with x11

2019-05-15 Thread Sean Crosby

Hi Mahmood, I've never tried using the native X11 of SLURM without being ssh'ed into the submit node. Can you try ssh'ing with X11 forwarding to rocks7 (i.e. ssh -X user@rocks7) from a different machine, and then try your srun --x11 command? Sean -- Sean Crosby Senior DevOpsH

Re: [slurm-users] Issue with x11

2019-05-14 Thread Sean Crosby

Hi Mahmood, Are you physically logged into rocks7? Or are you connecting via SSH? $DISPLAY = :1 kind of means that you are physically logged into the machine Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services Research Computing | CoEPP | School of

Re: [slurm-users] Issue with x11

2019-05-14 Thread Sean Crosby

Hi Mahmood, To get native X11 working with SLURM, we had to add this config to sshd_config on the login node (your rocks7 host) X11UseLocalhost no You'll then need to restart sshd Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services Research Comp

Re: [slurm-users] Limit Number of Jobs Per User Per Partition

2019-04-20 Thread Sean Crosby

Hi Eric, Look at partition QOS - https://slurm.schedmd.com/SLUG15/Partition_QOS.pdf The QoS options are MaxJobsPerUser and MaxSubmitPerUser (and also PerAccount versions) Sean -- Sean Crosby Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services Research Computing | CoEPP

Re: [slurm-users] "fatal: can't stat gres.conf"

2018-07-23 Thread Sean Crosby

Hi Alex, What's the actual content of your gres.conf file? Seems to me that you have a trailing comma after the location of the nvidia device Our gres.conf has NodeName=gpuhost[001-077] Name=gpu Type=p100 File=/dev/nvidia0 Cores=0,2,4,6,8,10,12,14,16,18,20,22 NodeName=gpuhost[001-077] Name=gpu T

[slurm-users] Forcing CPU bindings

2018-05-31 Thread Sean Crosby

Hi, When a user requests all of the GPUs on a system, but less than the total number of CPUs, the CPU bindings aren't ideal [root@host ~]# nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 mlx5_3 mlx5_1 mlx5_2 mlx5_0 CPU Affinity GPU0 X PHB SYS SYS SYS PHB SYS PHB 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16

75 matches

Mail list logo