Re: [slurm-users] How can I get complete field values with without specify the length

2021-03-10 Thread Ole Holm Nielsen
On 3/10/21 12:06 PM, Reuti wrote: Am 09.03.2021 um 13:37 schrieb Marcus Boden : Then I have good news for you! There is the --delimiter option: https://slurm.schedmd.com/sacct.html#OPT_delimiter= Aha, perfect – thx. Maybe it should be noted in the man page for the "-p"/"-P". Good idea. I c

Re: [slurm-users] [EXT] slurmctld error

2021-04-05 Thread Ole Holm Nielsen
Hi Ioannis, On 06-04-2021 07:56, Ioannis Botsis wrote: slurmctld is active and running but on system reboot doesn’t start automatically…..I have to start it manually Maybe you will find my Slurm Wiki pages of use for setting up your Slurm system: https://wiki.fysik.dtu.dk/niflheim/SLURM Fo

[slurm-users] Updated "pestat" tool for printing Slurm nodes status with 1 line per node including job info

2021-04-06 Thread Ole Holm Nielsen
ter options. If you use pestat, could you kindly download the latest master version and test it on your system? The output of "squeue -O" and "sinfo -O" can be challenging to parse correctly, so if you find a bug in pestat, please open an issue on GitHub or send E-mail to me.

Re: [slurm-users] slurmrestd configuration

2021-04-08 Thread Ole Holm Nielsen
On 4/8/21 9:50 AM, Simone Riggi wrote: I write you about how to properly setup slurmrestd. ... 2) Installed slurm with: rpmbuild -ta slurm-20.11.5.tar.bz2 --with mysql --with slurmrestd --with jwt I don't see this "--with jwt" in the slurm.spec file: [slurm-20.11.5]# grep "# --with" slurm.s

Re: [slurm-users] derived counters

2021-04-11 Thread Ole Holm Nielsen
On 4/11/21 6:17 PM, Heckes, Frank wrote: Sorry, if this has been asked and answered before. Does someone created a script/sql-query or maybe can provide combination of command line flags to create a ‘report’ for: I'm not sure my Slurm tools do what you want, but maybe you can get partial answ

Re: [slurm-users] derived counters

2021-04-12 Thread Ole Holm Nielsen
Hi Frank, On 4/12/21 9:53 AM, Heckes, Frank wrote: Hello Ole, many thanks for sharing your scripts, they cover most of the topics I was looking for. (my apologies, I noticed them already, but didn't checked them careful enough). The script are very clean coded and documented. Great work. Th

Re: [slurm-users] Why does Slurm kill one particular user's jobs after a few seconds?

2021-04-15 Thread Ole Holm Nielsen
Hi Thomas, I wonder if your problem is related to that reported in this list thread? https://lists.schedmd.com/pipermail/slurm-users/2021-April/007107.html You could try to restart the slurmctld service, and also make sure your configuration (slurm.conf etc.) has been pushed correctly to the sl

Re: [slurm-users] derived counters

2021-04-16 Thread Ole Holm Nielsen
Hi Jürgen, On 4/13/21 6:29 PM, Juergen Salk wrote: * Heckes, Frank [210413 12:04]: This result from a mgmt. - question. How long jobs have to wait (in s, min, h, day) before they getting executed and how many jobs are waiting (are queued) for each partition in a certain time interval. The f

Re: [slurm-users] Slurm reservation for migrating user home directories

2021-04-16 Thread Ole Holm Nielsen
d by the reservation. I'm thinking of a reservation something like this: scontrol create reservation starttime=... duration=12:00:00 ReservationName=migrate_physics nodes=ALL Accounts=-physics Would this work as expected? Best regards, Ole On 16/04/2021 14.23, Ole Holm Nielsen wrote: I nee

[slurm-users] Slurm reservation for migrating user home directories

2021-04-16 Thread Ole Holm Nielsen
ion I can rsync the home directories from the old NFS server to the new NFS server and update the NFS automounter links. Question: Does anyone have experiences with this type of scenario? Any good ideas or suggestions for other methods for data migration? Thanks, Ole -- Ole Holm Nielse

Re: [slurm-users] configless in Slurm, can not find the ip of ctld

2021-04-19 Thread Ole Holm Nielsen
Hi wenxia...@126.com, What is your full DNS domain name, and is /etc/resolv.conf consistent with your DNS? It seems to me that your DNS server is named "slurmctld-source": NS slurmctld-source. so you may have an error in the DNS setup. The DNS SRV record can be looked up by: $ host -t S

Re: [slurm-users] In high availability scenario, what is the best way to synchronize state files with scontrol takeover command?

2021-04-19 Thread Ole Holm Nielsen
Hi wenxia...@126.com, I think it is safer to get some experience with Slurm *without* using initially a High Availability setup for the slurmctld server. I highly recommend you to study the SchedMD presentations available in the page https://slurm.schedmd.com/publications.html. In particular

Re: [slurm-users] What is an easy way to prevent users run programs on the master/login node.

2021-04-24 Thread Ole Holm Nielsen
On 24-04-2021 04:37, Cristóbal Navarro wrote: Hi Community, I have a set of users still not so familiar with slurm, and yesterday they bypassed srun/sbatch and just ran their CPU program directly on the head/login node thinking it would still run on the compute node. I am aware that I will nee

Re: [slurm-users] Slurm reservation for migrating user home directories

2021-04-27 Thread Ole Holm Nielsen
On 4/16/21 4:21 PM, Ole Holm Nielsen wrote: I'm thinking of a reservation something like this: scontrol create reservation starttime=...  duration=12:00:00 ReservationName=migrate_physics nodes=ALL Accounts=-physics For the record: The idea of creating a Slurm reservation for excl

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-04-28 Thread Ole Holm Nielsen
On 4/28/21 2:48 AM, Sid Young wrote: I use SaltStack to push out the slurm.conf file to all nodes and do a "scontrol reconfigure" of the slurmd, this makes management much easier across the cluster. You can also do service restarts from one point etc. Avoid NFS mounts for the config, if the mou

Re: [slurm-users] [External] slurmd -C vs lscpu - which do I use to populate slurm.conf?

2021-04-28 Thread Ole Holm Nielsen
On 4/29/21 1:06 AM, Michael Robbert wrote: I think that you want to use the output of slurmd -C, but if that isn’t telling you the truth then you may not have built slurm with the correct libraries. I believe that you need to build with hwloc in order to get the most accurate details of the CPU

Re: [slurm-users] [External] Re: PropagateResourceLimits

2021-04-29 Thread Ole Holm Nielsen
On 29-04-2021 18:54, Ryan Novosielski wrote: It may not for specifically PropagateResourceLimits – as I said, the docs are a little sparse on the “how” this actually works – but you’re not correct that PAM doesn’t come into play re: user jobs. If you have “UsePam = 1” set, and have an /etc/pam

Re: [slurm-users] Questions about adding new nodes to Slurm

2021-05-04 Thread Ole Holm Nielsen
The task of adding or removing nodes from Slurm is well documented and discussed in SchedMD presentations, please see my Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes /Ole On 04-05-2021 14:47, Tina Friedrich wrote: Not sure if that's changed but aren't there cases wh

Re: [slurm-users] Cluster usage, filtered by partition

2021-05-11 Thread Ole Holm Nielsen
On 5/11/21 11:06 AM, Diego Zuccato wrote: Is it possible to extract a "partition usage summary", like the one generated by "sreport cluster usage" but limited to a single partition (or a partition set)? I believe that sreport can't make per-partition reports. Alternatively, is there some reco

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-14 Thread Ole Holm Nielsen
On 14-05-2021 08:52, Diego Zuccato wrote: Il 14/05/2021 08:19, Christopher Samuel ha scritto: sreport -t percent -T ALL cluster utilization "sreport: fatal: No valid TRES given" :( This works correctly on our cluster: $ sreport -t percent -T ALL cluster utilization

Re: [slurm-users] Determining Cluster Usage Rate

2021-05-17 Thread Ole Holm Nielsen
On 5/17/21 8:59 AM, Diego Zuccato wrote: Il 15/05/21 00:43, Christopher Samuel ha scritto: It just doesn't recognize 'ALL'. It works if I specify the resources. That's odd, what does this say? sreport --version slurm-wlm 18.08.5-2 That's the package from Debian stable (we don't have the manpo

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-21 Thread Ole Holm Nielsen
Hi Loris, I don't know if this would solve your problem, but I think that node SSH keys should be gathered and distributed. See my notes in https://wiki.fysik.dtu.dk/niflheim/SLURM#ssh-keys-for-password-less-access-to-cluster-nodes /Ole On 21-05-2021 14:53, Loris Bennett wrote: Hi, We hav

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-25 Thread Ole Holm Nielsen
Hi Loris, I think you need, as pointed out by others, either of: * SSH keys, see https://wiki.fysik.dtu.dk/niflheim/SLURM#ssh-keys-for-password-less-access-to-cluster-nodes * SSH host-base authentication, see https://wiki.fysik.dtu.dk/niflheim/SLURM#host-based-authentication /Ole On 5/25/

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-25 Thread Ole Holm Nielsen
On 25-05-2021 19:03, Patrick Goetz wrote: On 5/25/21 11:07 AM, Loris Bennett wrote: PS Am I wrong to be surprised that this is something one needs to roll oneself?  It seems to me that most clusters would want to implement something similar.  Is that incorrect?  If not, are people doing somethin

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-25 Thread Ole Holm Nielsen
On 25-05-2021 18:07, Loris Bennett wrote: PS Am I wrong to be surprised that this is something one needs to roll oneself? It seems to me that most clusters would want to implement something similar. Is that incorrect? If not, are people doing something else? Or did some vendor setting things

Re: [slurm-users] Upgrading slurm - can I do it while jobs running?

2021-05-26 Thread Ole Holm Nielsen
On 26-05-2021 20:23, Will Dennis wrote: About to embark on my first Slurm upgrade (building from source now, into a versioned path /opt/slurm// which is then symlinked to /opt/slurm/current/ for the “in-use” one…) This is a new cluster, running 20.11.5 (which we now know has a CVE that was fixe

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-26 Thread Ole Holm Nielsen
Hi Loris, On 5/27/21 8:19 AM, Loris Bennett wrote: Regarding keys vs. host-based SSH, I see that host-based would be more elegant, but would involve more configuration. What exactly are the simplification gains you see? I just have a single cluster and naively I would think dropping a script in

Re: [slurm-users] Building SLURM with X11 support

2021-05-27 Thread Ole Holm Nielsen
On 5/27/21 2:07 PM, Thekla Loizou wrote: I am trying to use X11 forwarding in SLURM with no success. We are installing SLURM using RPMs that we generate with the command "rpmbuild -ta slurm*.tar.bz2" as per the documentation. I am currently working with SLURM version 20.11.7-1. What I am mis

Re: [slurm-users] Parent accounts

2021-05-28 Thread Ole Holm Nielsen
Hi Stefan, On 5/28/21 3:31 PM, Stefan Staeglich wrote: for our monitoring system I want to query the account hierarchy. Is there a better approach than to parse the output of sacctmgr list account withasso -nP One approach is to use the Slurm sreport tool which displays the account hierarchy

Re: [slurm-users] Parent accounts

2021-05-31 Thread Ole Holm Nielsen
Hi Stefan, On 5/28/21 3:31 PM, Stefan Staeglich wrote: for our monitoring system I want to query the account hierarchy. Is there a better approach than to parse the output of sacctmgr list account withasso -nP ? Something like sacctmgr list account parent=bla withasso -nP doesn't work. He

Re: [slurm-users] Slurm stats in JSON format

2021-06-07 Thread Ole Holm Nielsen
On 6/8/21 12:27 AM, Sid Young wrote: Is there a tool that will extract the job counts in JSON format? Such as #running, #in pending #onhold etc I am trying to build some custom dashboards for the our new cluster and this would be a really useful set of metrics to gather and display. We have

Re: [slurm-users] Information about finished jobs

2021-06-13 Thread Ole Holm Nielsen
On 6/14/21 8:26 AM, Gestió Servidors wrote: How can I get all information about a finished job in the same way as “scontrol show jobid=” when job is pending or running? Some minutes after job completion, you can only get the information which is stored in the Slurm database. My script "showj

Re: [slurm-users] Information about finished jobs

2021-06-14 Thread Ole Holm Nielsen
On 6/14/21 9:33 AM, Arthur Gilly wrote: A related question, on my setup, scontrol show job displays the standard output, standard error redirections as well as the wd, whereas this info is lost after completion when sacct is required. Is this something that's configurable so that this info is pre

Re: [slurm-users] monitor draining/drain nodes

2021-06-14 Thread Ole Holm Nielsen
On 6/14/21 7:50 AM, Marcus Boden wrote: Slurm provides the strigger[1] utility for that. You can set it up to automatically send mails when nodes go into drain. I provide some Slurm triggers examples in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/triggers On 12.06.21 22:29, Rodr

Re: [slurm-users] Information about finished jobs

2021-06-15 Thread Ole Holm Nielsen
On 6/15/21 12:07 PM, Peter Kjellström wrote: On Mon, 14 Jun 2021 09:33:02 +0200 (CEST) Arthur Gilly wrote: Hi all, A related question, on my setup, scontrol show job displays the standard output, standard error redirections as well as the wd, whereas this info is lost after completion when sa

[slurm-users] Updated "showuserjobs" tool for summaries of Slurm node and batch job status

2021-06-28 Thread Ole Holm Nielsen
github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserjobs Best regards, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

2021-07-01 Thread Ole Holm Nielsen
flheim/Slurm_installation#upgrading-slurm Best regards, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

[slurm-users] Configless Slurm: DNS SRV record does not work without FQDN on EL8 systems

2021-07-12 Thread Ole Holm Nielsen
nd FC34) as regards the lookup of SRV records? This issue is tracked in Slurm bug https://bugs.schedmd.com/show_bug.cgi?id=11878#c2 Thanks, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] Minimum requirements for Slurm daemons?

2021-07-12 Thread Ole Holm Nielsen
On 12-07-2021 20:17, Heitor wrote: Hello, I'm trying to find the minimum requirements (mainly CPU and RAM) for the slurmctld, sulrmdbd, and slurmrestd daemons, but I did not find it in the docs. Maybe I missed some page? SchedMD recommends that the slurmctld server should have only a few, but

Re: [slurm-users] problem building pam_slurm_adopt

2021-07-14 Thread Ole Holm Nielsen
For CentOS, the list of all prerequisites for building Slurm is here: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-prerequisites /Ole On 7/14/21 11:51 AM, Sean Crosby wrote: Hi Mike, To build pam_slurm_adopt, you need the pam-devel package installed on the node you're buildin

Re: [slurm-users] Minimum requirements for Slurm daemons?

2021-07-14 Thread Ole Holm Nielsen
On 7/14/21 3:26 PM, Heitor wrote: On Mon, 12 Jul 2021 21:00:45 +0200 Ole Holm Nielsen wrote: SchedMD recommends that the slurmctld server should have only a few, but very fast CPU cores, in order to ensure the best responsiveness. The database server should preferably run on a physical

Re: [slurm-users] 4 sockets but "

2021-07-20 Thread Ole Holm Nielsen
Hi Diego, The Xeon Platinum 8268 is a 24-core CPU: https://ark.intel.com/content/www/us/en/ark/products/192481/intel-xeon-platinum-8268-processor-35-75m-cache-2-90-ghz.html Questions: 1. So you have 4 physical sockets in each node? 2. Did you define a Sub NUMA Cluster (SNC) BIOS setting? Then

Re: [slurm-users] 4 sockets but "

2021-07-20 Thread Ole Holm Nielsen
Hi Diego, 2. Did you define a Sub NUMA Cluster (SNC) BIOS setting?  Then each physical socket would show up as two sockets (memory controllers), for a total of 8 "sockets" in your 4-socket system. I don't think so. Unless that's the default, I didn't change anything in the BIOS. Just checked t

Re: [slurm-users] 4 sockets but "

2021-07-21 Thread Ole Holm Nielsen
Hi Diego, On 21-07-2021 11:56, Diego Zuccato wrote: I suspendend testing config changes to update another machine. In the last test I added "CPUs=192" to the noe definition, restarted slurmctld and nothing changed. When I returned, I checked again and slurm reported 192 CPUs! Magic? I now remo

Re: [slurm-users] 4 sockets but "

2021-07-22 Thread Ole Holm Nielsen
Hi Diego, On 7/23/21 8:16 AM, Diego Zuccato wrote: The Configless Slurm (https://slurm.schedmd.com/configless_slurm.html) from 20.02 makes distribution of slurm.conf really simple. Eager to see it in Debian :) IMHO, there ought to be a community effort to provide up-to-date Slurm packages fo

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
Hi Loris, On 7/23/21 9:05 AM, Loris Bennett wrote: We use both Zabbix and pestat. Zabbix gives us general information on the state of the nodes and file systems, and we have added some Slurm metrics, such as number of jobs pending, amount of memory pending, number of GPUs pending, etc. This ha

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
Hi Diego, On 7/23/21 12:36 PM, Diego Zuccato wrote: I believe that slurmd reports the 15 minute CPU load average to the slurmctld, only.  So you got this information already. Yup. It's just unexpected: if you don't know, you run pestat and see that an idle node does have a very high load :) My

Re: [slurm-users] slumctld don't start at boot

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 12:29 PM, Riccardo Sucapane wrote: I am using Slurm as a workload manager on a system with a master and 3 nodes. The operating system used is the recent rocky linux 8.4 while for slurm, is used the version 20.11.8 taken from EPEL repository. Everything works correctly and when the syst

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 12:43 PM, Ole Holm Nielsen wrote: On 7/23/21 12:36 PM, Diego Zuccato wrote: I believe that slurmd reports the 15 minute CPU load average to the slurmctld, only.  So you got this information already. Yup. It's just unexpected: if you don't know, you run pestat and see th

Re: [slurm-users] slumctld don't start at boot

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 1:00 PM, Diego Zuccato wrote: We answered in parallel :) I usually prefer to avoid modifying system-managed files because system updates could reset 'em. Since systemd allows overrides, I chose to use 'em :) I agree with you! The permanent fix will change those Systemd files in 2

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 1:07 PM, Diego Zuccato wrote: Well, Slurm reports the 15-minute load average.  I guess users will have to learn that, because we can't print help information every time. They'd probably omit reading it anyway... Actually, I found a bit of unused space below the CPUload heading, so I

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 1:15 PM, Ole Holm Nielsen wrote: On 7/23/21 1:07 PM, Diego Zuccato wrote: Well, Slurm reports the 15-minute load average.  I guess users will have to learn that, because we can't print help information every time. They'd probably omit reading it anyway... Actually, I foun

Re: [slurm-users] 4 sockets but "

2021-07-23 Thread Ole Holm Nielsen
On 7/23/21 1:24 PM, Diego Zuccato wrote: Il 23/07/2021 13:15, Ole Holm Nielsen ha scritto: But it's not whowing jobIDs nor users :( That is really strange!  The pestat obtains username and jobid from the squeue command.  Do you get this information from "squeue -t running"

Re: [slurm-users] History of pending jobs

2021-07-30 Thread Ole Holm Nielsen
On 30-07-2021 20:42, Glenn (Gedaliah) Wolosh wrote: I'm interested on getting an idea how long jobs were pending in a particular partition. Is there any magic to sreport or sacct that can generate this info. I could also use something like:"sreport cluster utilization" broken down by partitio

Re: [slurm-users] Submit time instead of Start time for sacct

2021-08-09 Thread Ole Holm Nielsen
On 09-08-2021 17:24, Amjad Syed wrote: I am trying to filter number of jobs submitted in a month , not jobs that started . if i use sacct -S 2021-07-07 -E 2021-08-07 --format=jobID,Submit -D JobIDSubmit --- 72749032021-06-09T11:30:46 I get jobs that were submit

Re: [slurm-users] sacct output in tabular form

2021-08-25 Thread Ole Holm Nielsen
Hi Sven, On 8/25/21 7:41 AM, Sternberger, Sven wrote: this is a simple wrapper for sacct which prints the output from sacct as table. So you can make a "sacctml -j foo --long" even without two 8k displays ;-) This script works nicely, thanks! However, in stead of an extremely wide display on

Re: [slurm-users] Slurm does not start after (stupid) upgrade from 16.05.9 to 20.11.7

2021-08-25 Thread Ole Holm Nielsen
On 8/25/21 10:48 AM, Julien Tailleur wrote: We have been running a computing cluster using slurm since 2016, that I installed back then, with some help from others. I was pretty late on upgrades and decided to upgrade the cluster up to debian Bullseye, which runs slurm 20.11.7, starting from st

Re: [slurm-users] free resources

2021-08-26 Thread Ole Holm Nielsen
ndocumented! Do you have documentation for it? /Ole -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: 26 August 2021 12:41 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] free resources On 26-08-2021 08:01, Pankaj Dorlikar wrote: We are using slurm-20.11.

Re: [slurm-users] free resources

2021-08-26 Thread Ole Holm Nielsen
On 26-08-2021 08:01, Pankaj Dorlikar wrote: We are using slurm-20.11..7 on ubuntu system having GPUs. What is the equivalent of “showbf –S” in maui or any command in slurm for checking the free resources ? Maybe "sinfo -t idle"? The showbf manual doesn't document any -S flag: https://docs.ada

Re: [slurm-users] Secondary Unix group id of users not being issued in interactive srun command

2021-09-21 Thread Ole Holm Nielsen
On 9/21/21 9:11 AM, Amjad Syed wrote: We have users who have have defined unix secondary id on our login nodes. vas20xhu@login01 ~]$ groups BIO_pg BIO_AFMAKAY_LAB_USERS But when we run interactive  and go to compute node , the user does not have secondary  group of BIO_AFMAKAY_LAB_USERS vas

Re: [slurm-users] Error when upgrading to 21.08.1

2021-09-23 Thread Ole Holm Nielsen
On 23-09-2021 16:01, Hoot Thompson wrote: In upgrading to 21.08.1, slurmctld status reports: Sep 23 13:49:52 ip-10-10-7-17 systemd[1]: Started Slurm controller daemon. Sep 23 13:49:52 ip-10-10-7-17 slurmctld[1323]: fatal: Unable to find plugin: serializer/json Sep 23 13:49:52 ip-10-10-7-17 s

Re: [slurm-users] "Low RealMem" after upgrade

2021-10-05 Thread Ole Holm Nielsen
On 10/5/21 8:05 AM, Diego Zuccato wrote: I already tried multiple times, both RESUME and IDLE, and it didn't work: it just returned to "IDLE+DRAIN" with 'Reason="low realmem"'. :( I just tried again (after an unplanned shutdown of the frontend) and it What is a "frontend"? Do you mean the slu

Re: [slurm-users] Equivalent command for showbf command of maui in slurm

2021-10-06 Thread Ole Holm Nielsen
On 06-10-2021 18:42, Pankaj Dorlikar wrote: We would like to know the free / available resources (cpu and GPUs) in slurm. In torque/maui, showbf –S command gives the similar output. What command / commanline should be used to check available / free resources including node number and its corres

Re: [slurm-users] Equivalent command for showbf command of maui in slurm

2021-10-06 Thread Ole Holm Nielsen
TY comp39.node 16 64322 1 64322 3:16:11:01 . . and so on On October 6, 2021 at 11:42 PM Ole Holm Nielsen wrote: > On 06-10-2021 18:42, Pankaj Dorlikar wrote: > > We would like to know the free / available resources (cpu and GPUs) in > > slurm. In torque/maui, showbf –S comman

Re: [slurm-users] Equivalent command for showbf command of maui in slurm

2021-10-06 Thread Ole Holm Nielsen
al Message----- From: Ole Holm Nielsen Sent: 07 October 2021 11:34 To: Slurm User Community List Cc: pankajd Subject: Re: [slurm-users] Equivalent command for showbf command of maui in slurm The "showbf --help" does not explain the meaning of the output of "showbf -S"

Re: [slurm-users] job is pending but resources are available

2021-10-13 Thread Ole Holm Nielsen
On 10/13/21 9:59 AM, Adam Xu wrote: 在 2021/10/13 9:22, Brian Andrus 写道: Something is very odd when you have the node reporting: RealMemory=1 AllocMem=0 FreeMem=47563 Sockets=2 Boards=1 What do you get when you run ‘slurmd -C’ on the node? # slurmd -C NodeName=apollo CPUs=36 Boards=1 Socket

Re: [slurm-users] How to look for free nodes of a certain constraint efficiently

2021-10-14 Thread Ole Holm Nielsen
Hi Matt, How about this sinfo command: $ sinfo -O NodeList:30,Features:30,StateLong NODELIST AVAIL_FEATURESSTATE i023 xeon2650v2,infiniband,xeon16 draining@ i[004-022,024-050]xeon2650v2,infiniband,xeon16 allocated

Re: [slurm-users] Missing data in sreport for a time period in slurm

2021-10-18 Thread Ole Holm Nielsen
On 10/18/21 12:41 PM, mshubham wrote: Dear all, I am facing a issue in slurm(v19.05.1) in which data from 26 May 2020 to Sept 14 2021 is missing in sreport but the same data is present through sacct command, It which was working fine few days ago. Right now, we have to get data utilization fro

Re: [slurm-users] Additional feature request/ need of how to

2021-10-22 Thread Ole Holm Nielsen
On 22-10-2021 14:45, BELLENCONTRE, FREDERIC wrote: I want to know on each node  the  job that will terminate the latest and get the planed date of completion ( it corresponds to the date when we could pass from draining status to drained status if no new jobs and no premature end) -eventually f

Re: [slurm-users] Possible to get cluster utilization by partition?

2021-11-05 Thread Ole Holm Nielsen
Hi Dave, On 11/4/21 21:47, Chin,David wrote: I am running Slurm 20.02.7. I would like to generate cluster utilization report based on the billing TRES, but separated by partition. I can get full cluster utilization using:     sreport cluster utilization -T billing start=2021-01-01 end=2021-06

Re: [slurm-users] Wrong hwloc detected?

2021-11-05 Thread Ole Holm Nielsen
On 11/5/21 12:47, Diego Zuccato wrote: Some users are reporting this error: slurmstepd-str957-mtx-01: error: hwloc_get_obj_below_by_type() failing, task/affinity plugin may be required to address bug fixed in HWLOC version 1.11.5 slurmstepd-str957-mtx-01: error: task[0] unable to set taskset

Re: [slurm-users] Wrong hwloc detected?

2021-11-07 Thread Ole Holm Nielsen
/Slurm_installation#install-prerequisites /Ole On 05-11-2021 15:38, Diego Zuccato wrote: They aren't using modules so it must be something system-wide :( But not all jobs are impacted. And it seems it's a bit random (doesn't happen always). I'm out of ideas, currently :( Il 05/11/

Re: [slurm-users] How to get an estimate of job completion for planned maintenance?

2021-11-09 Thread Ole Holm Nielsen
On 11/9/21 13:55, Marcus Wagner wrote: I have written a script, which loops through all runnning jobs to tell me, when a job ends on a specific node. This can be also done for all nodes. The output would be for the longest job e.g.: ncm0430  -> 2021-12-04T15:48:35 Nonetheless, we

Re: [slurm-users] sreport question when specifying partitions=

2021-11-10 Thread Ole Holm Nielsen
On 10-11-2021 16:56, Bill Wichser wrote: I can't seem to figure out how to do a query against a partition. sreport cluster AccountUtilizationByUser user=bill cluster=della, no issues.  Works as expected. sreport cluster AccountUtilizationByUser Partitions=cpu cluster=della gives me Unknown

Re: [slurm-users] enable_configless, srun and DNS vs. hosts file

2021-11-15 Thread Ole Holm Nielsen
On 12-11-2021 15:37, Paul Brunk wrote: We run configless. If we add a node to slurm.conf and don't restart slurmd on our submit nodes, then attempts to submit to that new node will get the error you saw. Restarting slurmd on the submit node fixes it. This is the documented behavior (adding

Re: [slurm-users] Changing DefaultAccount for user

2021-11-23 Thread Ole Holm Nielsen
Hi Loris, First you add the user to one or more other Slurm accounts, something like this: $ sacctmgr add user xxx account=yyy Then you can redefine the user's default account: $ sacctmgr modify user where name=xxx set defaultaccount=yyy Here is an example from our cluster where the user is

Re: [slurm-users] A Slurm topological scheduling question

2021-12-07 Thread Ole Holm Nielsen
Hi David, The topology.conf file groups nodes into sets such that parallel jobs will not be scheduled by Slurm across disjoint sets. Even though the topology.conf man-page refers to network switches, it's really about topology rather than network. You may use fake (non-existing) switch name

Re: [slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

2021-12-13 Thread Ole Holm Nielsen
Hi Loris, Thanks for the note. I need to figure out the correct variable width printf() options. I'm working on an update... Best regards, Ole On 12/13/21 13:56, Loris Bennett wrote: Hi Ole, Ole Holm Nielsen writes: Hi Slurm users, I have updated the "pestat" tool for

Re: [slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

2021-12-13 Thread Ole Holm Nielsen
Hi Loris, I fixed errors in the hostnamelength calculation and formatting. Could you grab the latest pestat and test it? Thanks, Ole On 12/13/21 13:56, Loris Bennett wrote: Hi Ole, Ole Holm Nielsen writes: Hi Slurm users, I have updated the "pestat" tool for printing Slurm no

Re: [slurm-users] Add new compute node without interruption

2021-12-13 Thread Ole Holm Nielsen
On 13-12-2021 18:55, Microbiome Studio wrote: We would like to know if it is planned to add this feature: Adding new compute node without interruption Indeed actually we have to stop compution, declare new nodes and resume the computation. such feature would be really helpfull with the growth of

Re: [slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

2021-12-13 Thread Ole Holm Nielsen
12/13/21 15:31, Loris Bennett wrote: Hi Ole, The new version looks good to me. Cheers, Loris Ole Holm Nielsen writes: Hi Loris, I fixed errors in the hostnamelength calculation and formatting. Could you grab the latest pestat and test it? Thanks, Ole On 12/13/21 13:56, Loris Bennett wrote

Re: [slurm-users] Updated "pestat" tool for printing Slurm nodes status including GRES/GPU

2021-12-14 Thread Ole Holm Nielsen
21 14:16, Loris Bennett wrote: Hi Ole, Ole Holm Nielsen writes: The latest pestat version now adds a red color highlight if the GRES GPU is the (null) value. We use this to highlight jobs on GPU nodes which didn't request any GPU resources, thereby possibly wasting resources. Could you te

Re: [slurm-users] How to limit # of execution slots for a given node

2022-01-06 Thread Ole Holm Nielsen
Hi David, On 1/6/22 22:39, David Henkemeyer wrote: When my team used PBS, we had several nodes that had a TON of CPUs, so many, in fact, that we ended up setting np to a smaller value, in order to not starve the system of memory. What is the best way to do this with Slurm?  I tried modifying

Re: [slurm-users] memory per node default

2022-01-20 Thread Ole Holm Nielsen
On 1/20/22 22:22, Hoot Thompson wrote: How do you change the default memory per node from the current 1MB to something much higher? Thanks in advance. *ubuntu@node*:*/shared*$ sinfo -o "%20N%10c%10m%25f%10G " NODELISTCPUSMEMORYAVAIL_FEATURES GRES hpc-demand-dy-c5n18x361 dynamic,c5n.18xlarge

Re: [slurm-users] memory per node default

2022-01-21 Thread Ole Holm Nielsen
On 1/21/22 10:05, Diego Zuccato wrote: Il 21/01/2022 07:51, Ole Holm Nielsen ha scritto: There's a nice command to run on any given node which tells you slurmd's view of the node: $ slurmd -C NodeName=i004 CPUs=16 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=1 RealMem

Re: [slurm-users] List only available and up partitions

2022-01-26 Thread Ole Holm Nielsen
A similar tool is "showpartitions" from https://github.com/OleHolmNielsen/Slurm_tools/tree/master/partitions /Ole On 1/27/22 00:13, mercan wrote: You can use the spart to list only partitions a user has access to that are in the 'UP' state (and with other limiting factors such as partition li

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-27 Thread Ole Holm Nielsen
Maybe my Slurm Wiki pages will help you get started: https://wiki.fysik.dtu.dk/niflheim/SLURM Best regards, Ole On 1/27/22 10:53, Nousheen wrote: I am installing slurm on Centos 7 following tutorial: https://www.slothparadise.com/how-to-install-slurm-on-centos-7-cluster/

Re: [slurm-users] systemctl enable slurmd.service Failed to execute operation: No such file or directory

2022-01-31 Thread Ole Holm Nielsen
Hi Nousheen, I recommend you again to follow the steps for installing Slurm on a CentOS 7 cluster: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation Maybe you will need to start installation from scratch, but the steps are guaranteed to work if followed correctly. IHTH, Ole On 1/31/22

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Ole Holm Nielsen
Login nodes being down doesn't affect Slurm jobs at all (except if you run slurmctld/slurmdbd on the login node ;-) To stop new jobs from being scheduled for running, mark all partitions down. This is useful when recovering the cluster from a power or cooling downtime, for example. I wrote

Re: [slurm-users] Stopping new jobs but letting old ones end

2022-01-31 Thread Ole Holm Nielsen
One ting to be aware about when setting partition states to down: * Setting partition state=down will be reset if slurmctld is restarted. Read the slurmctld man-page under the -R parameter. So it's better not to restart slurmctld during the downtime. /Ole On 2/1/22 08:11, Ole Holm Ni

Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-03 Thread Ole Holm Nielsen
On 03-02-2022 16:37, Nathan Smith wrote: Yes, we are running slurmdbd. We could arrange enough downtime to do an incremental upgrade of major versions as Brian Andrus suggested, at least on the slurmctld and slurmdbd systems. The slurmds I would just do a direct upgrade once the scheduler work

Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-04 Thread Ole Holm Nielsen
On 04-02-2022 08:59, Bjørn-Helge Mevik wrote: Ole Holm Nielsen writes: As Brian Andrus said, you must upgrade Slurm by at most 2 major versions, and that includes slurmd's as well! Don't do a "direct upgrade" of slurmd by more than 2 versions! That should only be

Re: [slurm-users] Upgrade from 17.02.11 to 21.08.2 and state information

2022-02-04 Thread Ole Holm Nielsen
On 03-02-2022 21:59, Ryan Novosielski wrote: On Feb 3, 2022, at 2:55 PM, Ole Holm Nielsen wrote: On 03-02-2022 16:37, Nathan Smith wrote: Yes, we are running slurmdbd. We could arrange enough downtime to do an incremental upgrade of major versions as Brian Andrus suggested, at least on the

Re: [slurm-users] Make sacct show short job state codes?

2022-03-24 Thread Ole Holm Nielsen
Hi Chip, Use the sacct -p or --parsable option to get the complete output delimited by | /Ole On 3/24/22 14:12, Chip Seraphine wrote: I’m trying to shave a few columns off the output of some sacct output, and while it will happily accept the short codes (e.g. CA instead of CANCELLED) I ca

Re: [slurm-users] Make sacct show short job state codes?

2022-03-24 Thread Ole Holm Nielsen
Here is an example command for getting parseable output from sacct of all completed jobs during a specific period of time: $ sacct -p -X -a -S 032322 -E 032422 -o JobID,User,State -s ca,cd,f,to,pr,oom The fields are separated by | and can easily be parsed by awk. Example output: JobID|User|St

Re: [slurm-users] sbatch doesn't output the jobid

2022-03-31 Thread Ole Holm Nielsen
On 3/31/22 11:25, GHui wrote: Sometimes when I run sbatch to submit a job, it doesn't output anything. But when I run squeue, the job is running. Because of this, If I submit many times, I missed the jobs which I submitted. Which Slurm version do you run? Use "sinfo --version" to find out. Yo

Re: [slurm-users] Looking for examples of daily job reports

2022-04-18 Thread Ole Holm Nielsen
On 15-04-2022 19:13, Brian Andrus wrote: Not to steal his thunder, but Ole has done a great job with quite a few things. He has some job scripts at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs I fully expect him to chime in and offer additional great advice. Thanks for the

Re: [slurm-users] CommunicationParameters=block_null_hash issue in 21.08.8

2022-05-05 Thread Ole Holm Nielsen
head You can add more -O options to get JobIDs etc., as long as you sort on the StartTime column (Slurm ISO 8601 timestamps[1] can simply be sorted in lexicographical order). I hope this helps. /Ole [1] https://en.wikipedia.org/wiki/ISO_8601 On 05.05.22 13:53, Ole Holm Nielsen wrote: J

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ole Holm Nielsen
Hi Tina, On 5/5/22 14:54, Tina Friedrich wrote: Hi List, out of curiosity - I would assume that if running configless, one doesn't manually need to restart slurmd on the nodes if the config changes? That is correct. Just do "scontrol reconfig" on the slurmctld server. If all your slurmd's

Re: [slurm-users] SLURM: reconfig

2022-05-05 Thread Ole Holm Nielsen
On 5/5/22 15:53, Ward Poelmans wrote: Hi Steven, I think truly dynamic adding and removing of nodes is something that's on the roadmap for slurm 23.02? Yes, see slide 37 in https://slurm.schedmd.com/SLUG21/Roadmap.pdf from the Slurm publications site https://slurm.schedmd.com/publications.ht

<    1   2   3   4   5   6   >