Re: [slurm-users] slurm sinfo format memory

2023-07-21 Thread Ole Holm Nielsen
Hi Arsene, On 7/20/23 18:24, Arsene Marian Alain wrote: > I would like to see the following information of my nodes "hostname, total > mem, free mem and cpus". So, I used  ‘sinfo -o "%8n %8m %8e %C"’ but in > the output it shows me the memory in MB like "190560" and I need it in GB > (without d

Re: [slurm-users] Is there any public scientific-workflow example that can be run through Slurm?

2023-08-18 Thread Ole Holm Nielsen
Hi Alper, On 18-08-2023 18:39, Alper Alimoglu wrote: In slurm we can build pipelines using [slurm dependencies][1], which allows us to run workflows. In my work, I have stuck in a point regarding finding a workflow that I can run using Slurm. As an example, I have to use a workflow benchmar

Re: [slurm-users] bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-29 Thread Ole Holm Nielsen
Hi Magnus, On 8/28/23 10:16, Hagdorn, Magnus Karl Moritz wrote: we recently enabled the energy gathering plugin on using the IPMI gatherer with libfreeipmi. We are running the latest slurm 23.02.4 on rocky 8.5. We are getting sporadic buffer overflows in slurmd when it is trying to query the IPM

Re: [slurm-users] [ext] Re: bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-29 Thread Ole Holm Nielsen
Hi Magnus, On 29-08-2023 13:56, Hagdorn, Magnus Karl Moritz wrote: I'm curious to learn about your energy gathering method:  How do you extract node power using IPMI using FreeIMPI (or some other toolset), and how do you configure Slurm for this? We are using the SLURM plugin which is enabled

Re: [slurm-users] [ext] Re: bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-30 Thread Ole Holm Nielsen
Hi Magnus, On 8/30/23 10:12, Hagdorn, Magnus Karl Moritz wrote: Yes, but can you share the details of which parameters you configure in this plugin so that you can extract node power?  This doesn't seem obvious to me. not much needs configuring. We have EnergyIPMIFrequency=10 EnergyIPMICalcAd

Re: [slurm-users] [ext] Re: bufferoverflow in slurmd with acct_gather_energy plugin

2023-08-30 Thread Ole Holm Nielsen
Hi Magnus, On 8/30/23 11:17, Hagdorn, Magnus Karl Moritz wrote: On Wed, 2023-08-30 at 10:38 +0200, Ole Holm Nielsen wrote: This is a very useful example!  I guess that you have also defined EnergyIPMIUsername and EnergyIPMIPassword in acct_gather.conf?  How is the EnergyIPMIPassword protected

Re: [slurm-users] help with canceling or deleteing a job

2023-09-19 Thread Ole Holm Nielsen
On 9/19/23 13:59, Felix wrote: Hello I have a job on my system which is running more than its time, more than 4 days. 1808851 debug  gridjob  atlas01 CG 4-00:00:19  1 awn-047 The job has state "CG" which means "Completing". The Completing status is explained in "man sinfo". T

Re: [slurm-users] help with canceling or deleteing a job

2023-09-20 Thread Ole Holm Nielsen
On 9/20/23 01:39, Feng Zhang wrote: Restarting the slurmd dameon of the compute node should work, if the node is still online and normal. Probably not. If the filesystem used by the job is hung, the node must probably be rebooted, and the filesystem must be checked. /Ole On Tue, Sep 19, 2

Re: [slurm-users] question about configuration in slurm.conf

2023-09-26 Thread Ole Holm Nielsen
On 9/26/23 14:50, Groner, Rob wrote: There's a builtin slurm command, I can't remember what it is and google is failing me, that will take a compacted list of nodenames and return their full names, and I'm PRETTY sure it will do the opposite as well (what you're asking for). It's probably sin

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-28 Thread Ole Holm Nielsen
On 9/28/23 17:58, Groner, Rob wrote: There's 14 steps to upgrading slurm listed on their website, including shutting down and backing up the database.  So far we've only updated slurm during a downtime, and it's been a major version change, so we've taken all the steps indicated. We now want

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-29 Thread Ole Holm Nielsen
On 29-09-2023 17:33, Ryan Novosielski wrote: I’ll just say, we haven’t done an online/jobs running upgrade recently (in part because we know our database upgrade will take a long time, and we have some processes that rely on -M), but we have done it and it does work fine. So the paranoia isn’t

Re: [slurm-users] Slurm powersave

2023-10-05 Thread Ole Holm Nielsen
Hi Davide, On 10/4/23 23:03, Davide DelVento wrote: I'm experimenting with slurm powersave and I have several questions. I'm following the guidance from https://slurm.schedmd.com/power_save.html and the great presentation from our own https://slurm.s

Re: [slurm-users] Slurm powersave

2023-10-06 Thread Ole Holm Nielsen
Hi Davide, On 10/5/23 15:28, Davide DelVento wrote: IMHO, "pretending" to power down nodes defies the logic of the Slurm power_save plugin. And it is sure useless ;) But I was using the suggestion from https://slurm.schedmd.com/power_save.html

Re: [slurm-users] Slurm versions 23.02.6 and 22.05.10 are now available (CVE-2023-41914)

2023-10-13 Thread Ole Holm Nielsen
On 10/13/23 12:22, Taras Shapovalov wrote: Oh, does this mean that no one should use Slurm versions <= 21.08 any more? SchedMD recommends to use the currently supported versions (currently 22.05 or 23.02). Next month 23.11 will be released and 22.05 will become unsupported. The question fo

Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-25 Thread Ole Holm Nielsen
Hi Tim, I think the scontrol manual page explains the "scontrol reboot" function fairly well: reboot [ASAP] [nextstate={RESUME|DOWN}] [reason=] {ALL|} Reboot the nodes in the system when they become idle using the RebootProgram as co

Re: [slurm-users] Change something in user's script using job_submit.lua plugin

2023-10-26 Thread Ole Holm Nielsen
Hi Paulo, Which Slurm version do you have, and did you set this in slurm.conf: JobSubmitPlugins=lua ? Perhaps you may find some useful information in this Wiki page: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#job-submit-plugins /Ole On 26-10-2023 19:07, Paulo Jose Braga

Re: [slurm-users] RES: Change something in user's script using job_submit.lua plugin

2023-10-27 Thread Ole Holm Nielsen
. Best regards, PÚBLICA -Mensagem original- De: slurm-users Em nome de Ole Holm Nielsen Enviada em: sexta-feira, 27 de outubro de 2023 03:31 Para: slurm-users@lists.schedmd.com Assunto: Re: [slurm-users] Change something in user's script using job_submit.lua plugin Hi Paulo, Which Slurm

[slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
fix or workaround? Of course we could remove the Infiniband check in Node Health Check (NHC), but that would not really be acceptable during operations. Thanks for sharing any insights, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
may need to do small adjustments, but it's pretty straight forward -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.d

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-30 Thread Ole Holm Nielsen
Hi Jens, Thanks for your feedback: On 30-10-2023 15:52, Jens Elkner wrote: Actually there is no need for such a script since /lib/systemd/systemd-networkd-wait-online should be able to handle it. It seems that systemd-networkd exists in Fedora FC38 Linux, but not in RHEL 8 and clones, AFAICT

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-10-31 Thread Ole Holm Nielsen
-users On Behalf Of Ole Holm Nielsen Sent: Monday, October 30, 2023 1:56 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up? ◆ This message was sent from a non-UWYO address. Please exercise caution when clic

Re: [slurm-users] RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen
cutes: ExecStart=/usr/bin/nm-online -s -q and this is causing our problems with Infiniband/OPA networks. This is the reason why we need Max's workaround wait-for-interfaces.service. /Ole -Mensagem original- De: slurm-users Em nome de Ole Holm Nielsen Enviada em: terça-fe

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen
new jobs. We should configure NHC to make site-specific hardware and network checks, for example for Infiniband/OPA network or NVIDIA GPUs. Best regards, Ole On 11/1/23 09:44, Rémi Palancher wrote: Hi Ole, Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit : I'm fighting this strange scen

Re: [slurm-users] RES: RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen
activating state visible with nmcli device and nmcli connection, startup is still pending. ***" PÚBLICA -Mensagem original- De: slurm-users Em nome de Ole Holm Nielsen Enviada em: quarta-feira, 1 de novembro de 2023 05:19 Para: slurm-users@lists.schedmd.com Assunto: Re:

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-02 Thread Ole Holm Nielsen
Hi Ward, Thanks a lot for the feedback! The method of probing /sys/class/infiniband/*/ports/*/state is also used in the NHC script lbnl_hw.nhc and has the advantage of not depending on the nmcli command from the NetworkManager package. Can I ask you how you implement your script as a servic

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-10 Thread Ole Holm Nielsen
think of these considerations? Best regards, Ole On 2/11/2023 09:28, Ole Holm Nielsen wrote: Hi Ward, Thanks a lot for the feedback!  The method of probing /sys/class/infiniband/*/ports/*/state is also used in the NHC script lbnl_hw.nhc and has the advantage of not depending on the nmcli comma

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-13 Thread Ole Holm Nielsen
ard Poelmans wrote: Hi Ole, On 10/11/2023 15:04, Ole Holm Nielsen wrote: On 11/5/23 21:32, Ward Poelmans wrote: Yes, it's very similar. I've put our systemd unit file also online on https://gist.github.com/wpoely86/cf88e8e41ee885677082a7b08e12ae11 This might disturb the logic in wa

Re: [slurm-users] SLURM new user query, does SLURM has GUI /Web based management version also

2023-11-19 Thread Ole Holm Nielsen
On 19-11-2023 09:11, Joseph John wrote: I am new user, trying out SLURM Like to check if the SLURM has a GUI/web based management tool also Did you read the Quick Start Administrator Guide at https://slurm.schedmd.com/quickstart_admin.html ? I don't believe there are any Slurm management tool

Re: [slurm-users] Releasing stale allocated TRES

2023-11-23 Thread Ole Holm Nielsen
On 11/23/23 11:50, Markus Kötter wrote: On 23.11.23 10:56, Schneider, Gerald wrote: I have a recurring problem with allocated TRES, which are not released after all jobs on that node are finished. The TRES are still marked as allocated and no new jobs can't be scheduled on that node using those

Re: [slurm-users] slurm comunication between versions

2023-11-23 Thread Ole Holm Nielsen
Hi Felix, On 11/23/23 18:14, Felix wrote: Will slurm-20.02 which is installed on a management node comunicate with slurm-22.05 installed on a work nodes? They have the same configuration file slurm.conf Or do the version have to be the same. Slurm 20.02 was installed manually and slurm 22.05

Re: [slurm-users] Slurm version 23.11 is now available

2023-11-24 Thread Ole Holm Nielsen
On 11/24/23 09:31, Gestió Servidors wrote: Some days ago, I started to configure a new server with SLURM 23.02.5. Yesterday, I read in this mailing list that version 23.11.0 was released, so today I have compiled this latest version. However, after starting slurmdbd (with a database upgrade), I

Re: [slurm-users] Slurm version 23.11 is now available

2023-11-24 Thread Ole Holm Nielsen
On 11/24/23 12:15, Ole Holm Nielsen wrote: On 11/24/23 09:31, Gestió Servidors wrote: Some days ago, I started to configure a new server with SLURM 23.02.5. Yesterday, I read in this mailing list that version 23.11.0 was released, so today I have compiled this latest version. However, after

Re: [slurm-users] RPC rate limiting for different users

2023-11-28 Thread Ole Holm Nielsen
On 11/28/23 11:59, Cutts, Tim wrote: Is the new rate limiting feature always global for all users, or is there an option, which I’ve missed, to have different settings for different users?  For example, to allow a higher rate from web services which submit jobs on behalf of a large number of us

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen
Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedSt

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen
something that necessarily occurs) with the command. But I will take a closer look. I really feel like it has to be something more conditional though as otherwise the error would've occurred more often (i.e. every time when handling a fail and the command is execute). >> IHTH, Ole

Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen
On 12/6/23 11:51, Xaver Stiensmeier wrote: Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. There are nice bug fixes in 23.02 mentioned in my SLU

Re: [slurm-users] SlurmdSpoolDir full

2023-12-08 Thread Ole Holm Nielsen
Hi Xaver, On 12/8/23 16:00, Xaver Stiensmeier wrote: during a larger cluster run (the same I mentioned earlier 242 nodes), I got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a directory on the workers that is used for job state information (https://slurm.schedmd.com/slurm.co

Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Ole Holm Nielsen
On 10-12-2023 17:29, Ryan Novosielski wrote: This is basically always somebody filling up /tmp and /tmp residing on the same filesystem as the actual SlurmdSpoolDirectory. /tmp, without modifications, it’s almost certainly the wrong place for temporary HPC files. Too large. Agreed! That's w

Re: [slurm-users] How to check the bench mark capacity of the SLURM setup

2023-12-13 Thread Ole Holm Nielsen
On 12/13/23 07:13, John Joseph wrote: We have setup of slurm setup for a HPC setup of 4 node We want to do a stress test , guidnace requested for getting a  code which can test the functionality of the SLURM efficiency.  If there is such  a program, like to try out Guidance requested Then pl

Re: [slurm-users] How to check the bench mark capacity of the SLURM setup

2023-12-13 Thread Ole Holm Nielsen
On 12/13/23 10:44, John Joseph wrote: Thanks for the mail, and sorry for not properly explaining what info I was requesting, what actually I meant was that how could we could  do a check how the HPC system I set is working. Eg a program which can be run individually on a node, and comparing ho

Re: [slurm-users] install new slurm, no slurmctld found

2023-12-15 Thread Ole Holm Nielsen
Hi Farcas, On 12/15/23 11:00, Felix wrote: we are installing a new server with slurm on ALMA Linux 9.2 Slurm support on EL9 might perhaps be a little less mature than on EL8. we did the followimg: dnf install slurm The result is rpm -qa | grep slurm slurm-libs-22.05.9-1.el9.x86_64 slurm-2

Re: [slurm-users] Slurm compute node with Intel 12th gen CPU

2023-12-20 Thread Ole Holm Nielsen
On 20-12-2023 15:59, Michael Bernasconi wrote: I'm trying to get slurm working on an Intel 12th gen CPU. slurmd instantly fails with the error message "Thread count (24) not multiple of core count (16)". I have tried adding "SlurmdParameters=config_overrides" to slurm.conf, and I have experimen

Re: [slurm-users] How to run one maintenance job on each node in the cluster

2023-12-23 Thread Ole Holm Nielsen
On 23-12-2023 05:09, Jeffrey Tunison wrote: Is there a straightforward way to create a batch job that runs once on every node in the cluster? A technique simpler than generating a list from sinfo output and dispatching the job in a for loop for the N nodes. That’s not very hard, but I though

Re: [slurm-users] A fairshare policy that spans multiple clusters

2024-01-05 Thread Ole Holm Nielsen
On 05-01-2024 17:26, David Baker wrote: We are soon to install new Slurm cluster at our site. That means that we will have a total of three clusters running Slurm. Only two, that is the new clusters, will share a common file system. The original cluster has its own file system is independent of

Re: [slurm-users] error

2024-01-18 Thread Ole Holm Nielsen
On 1/18/24 17:42, Felix wrote: I started a new AMD node, and the error is as follows: "CPU frequency setting not configured for this node" extended looks like this: [2024-01-18T18:28:06.682] CPU frequency setting not configured for this node [2024-01-18T18:28:06.691] slurmd started on Thu, 18

Re: [slurm-users] after upgrade to 23.11.1 nodes stuck in completion state

2024-01-30 Thread Ole Holm Nielsen
On 1/30/24 09:36, Fokke Dijkstra wrote: We had similar issues with Slurm 23.11.1 (and 23.11.2). Jobs get stuck in a completing state and slurmd daemons can't be killed because they are left in a CLOSE-WAIT state. See my previous mail to the mailing list for the details. And also https://bugs.sc

Re: [slurm-users] Can't start slurmdbd

2017-11-21 Thread Ole Holm Nielsen
On 11/20/2017 10:50 AM, Juan A. Cordero Varelaq wrote: Slurm 17.02.3 was installed on my cluster some time ago but recently I decided to use SlurmDBD for the accounting. After installing several packages (slurm-devel, slurm-munge, slurm-perlapi, slurm-plugins, slurm-slurmdbd and slurm-sql) and

Re: [slurm-users] Missing systemd unit files in SLURM 17.11.0 RPMs

2017-11-30 Thread Ole Holm Nielsen
On 11/30/2017 01:40 PM, Alan Orth wrote: I just built SLURM 17.11.0 on a CentOS 7 machine and was surprised to see that several systemd unit files were missing from the RPMs. For some reason the slurmdbd.service file is present though: $ rpmbuild -ta slurm-17.11.0.tar.bz2 $ rpm -qlp slurm-17.1

Re: [slurm-users] Calculating total elapsed time of all jobs, per user, per month, for a given partition?

2017-12-17 Thread Ole Holm Nielsen
We generate monthly Slurm reports using this script: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmreportmonth If you miss something from the output, you're welcome to send me suggestions. /Ole On 12/18/2017 06:09 AM, Mark London wrote: Hi - For a given partition, how can I g

Re: [slurm-users] Calculating total elapsed time of all jobs, per user, per month, for a given partition?

2017-12-18 Thread Ole Holm Nielsen
job report can be printed by sreport selecting a partition list. /Ole On 12/18/2017 2:21 AM, Ole Holm Nielsen wrote: We generate monthly Slurm reports using this script: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmreportmonth If you miss something from the output, you&#x

Re: [slurm-users] Limit number of CPU in a partition

2018-01-02 Thread Ole Holm Nielsen
On 01/02/2018 11:29 AM, Nicolò Parmiggiani wrote: how can i limit the number of CPU that a partition can use? For instance when a partition reach its maximum CPUs number you can submit new job but they are put in queue. I would think that you can't use more CPUs than you have got! A resou

Re: [slurm-users] Limit number of CPU in a partition

2018-01-02 Thread Ole Holm Nielsen
On 01/02/2018 12:59 PM, Nicolò Parmiggiani wrote: My problem is that i have for instance 100 CPU, and i want to create two partition each with 50 CPU maximum usage. In this way i can submit job to both partitions independently. I wonder what you really want to achieve? Why do you want to divi

[slurm-users] slurmacct: An alternative Slurm accounting report tool

2018-01-02 Thread Ole Holm Nielsen
I'm announcing a "slurmacct" script/tool as an alternative to the Slurm accounting report tool "sreport". It's available on Github: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmacct This tool prints some job statistics which we used to get from our old Torque system (see th

Re: [slurm-users] which daemons should I restart when editing slurm.conf

2018-01-04 Thread Ole Holm Nielsen
On 01/04/2018 11:05 AM, Juan A. Cordero Varelaq wrote: Hi, I have the following configuration: * head node: hosts the slurmctld and the slurmdbd daemons. * compute nodes (4): host the slurmd daemons. I need to change a couple of lines of the slurm.conf corresponding to the slurmctld. If

Re: [slurm-users] Calculating total elapsed time of all jobs, per user, per month, for a given partition?

2018-01-08 Thread Ole Holm Nielsen
was going to create a monthly report. Your script is simpler than the other script I had found, which was written in the "R" language.   Much appreciated! - Mark On 1/3/2018 2:22 AM, Ole Holm Nielsen wrote: Hi Mark, Perhaps my new script slurmacct may fit your requirements?  See: h

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread Ole Holm Nielsen
I can highly recommend EasyBuild as an easy way to provide software packages as "modules" to your cluster. We have been very pleased with EasyBuild in our cluster. I made some notes about installing EasyBuild in a Wiki page: https://wiki.fysik.dtu.dk/niflheim/EasyBuild_modules We use CentOS

Re: [slurm-users] Slurm and available libraries

2018-01-17 Thread Ole Holm Nielsen
John: I would refrain from installing the old default package "environment-modules" from the Linux distribution, since it doesn't seem to be maintained any more. Lmod, on the other hand, is actively maintained and solves some problems with the old "environment-modules" software. There's an e

Re: [slurm-users] Free Gres resources

2018-02-13 Thread Ole Holm Nielsen
On 02/13/2018 08:13 AM, Nadav Toledo wrote:> Does anyone know of way to get amount of idle gpu per node or for all cluster ? sinfo -o %G gives the total amount of gres resource for each node. Is there a way to get the idle amount same as you can get for cpu (%C)? Perhaps if one use lock file li

Re: [slurm-users] MariaDB lock problems for sacctmgr delete query

2018-02-16 Thread Ole Holm Nielsen
We're planning to upgrade Slurm 17.02 to 17.11 soon, so it's important for us to test the slurmdbd and database upgrade before doing the actual upgrade. I've made a *successful* upgrade of the database migration from 17.02 to 17.11, making a dry run on an offlined compute node running CentOS 7

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-02-23 Thread Ole Holm Nielsen
On 22-02-2018 21:27, Christopher Benjamin Coffey wrote: Thanks Paul. I didn't realize we were tracking energy ( . Looks like the best way to stop tracking energy is to specify what you want to track with AccountingStorageTRES ? I'll give that a try. Perhaps it's a good idea for a lot of sites

Re: [slurm-users] maxim number of pending jobs

2018-03-08 Thread Ole Holm Nielsen
On 03/08/2018 04:00 PM, Renat Yakupov wrote: is there a limit to a maximum number of jobs that can be queued in pending state? If so, how can I find it out? Maybe this answers your question? scontrol show config | grep MaxJobCount If this was your question, you may find my script warn_maxjobs

Re: [slurm-users] maxim number of pending jobs

2018-03-08 Thread Ole Holm Nielsen
On 03/08/2018 04:37 PM, Ryan Novosielski wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/08/2018 10:29 AM, Ole Holm Nielsen wrote: On 03/08/2018 04:00 PM, Renat Yakupov wrote: is there a limit to a maximum number of jobs that can be queued in pending state? If so, how can I find it

Re: [slurm-users] maxim number of pending jobs

2018-03-09 Thread Ole Holm Nielsen
ay have to look into slurm.conf for additional parameters. /Ole On 8 March 2018 at 16:29, Ole Holm Nielsen <mailto:ole.h.niel...@fysik.dtu.dk>> wrote: On 03/08/2018 04:00 PM, Renat Yakupov wrote: is there a limit to a maximum number of jobs that can be queued in p

Re: [slurm-users] fast way for a node to determine its own state?

2018-03-21 Thread Ole Holm Nielsen
On 03/21/2018 11:18 AM, Alexis Huxley wrote: I'm running a node health script that needs to know the state of the node on which it is running. Currently, I'm getting the state with this: sinfo -N ... | grep `uname -n` Depending on the load on the scheduler, this can be slow. Is there fa

[slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-21 Thread Ole Holm Nielsen
We experience problems with MPI jobs dumping lots (1 per MPI task) of multi-GB core dump files, causing problems for file servers and compute nodes. The user has "ulimit -c 0" in his .bashrc file, but that's ignored when slurmd starts the job, and the slurmd process limits are employed in stea

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-21 Thread Ole Holm Nielsen
On 03/21/2018 01:08 PM, Bill Barth wrote: You could set /etc/security/limits.conf on every node to contain something like (check my syntax): * soft core 0 * hard core 0 Nice suggestion, however, processes spawned by slurmd doesn't read the /etc/security/limits.conf file. And make sure th

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-21 Thread Ole Holm Nielsen
On 03/21/2018 01:57 PM, Chris Samuel wrote: On Wednesday, 21 March 2018 11:49:53 PM AEDT Ole Holm Nielsen wrote: However, there are no /etc/pam.d/slurm.* files on our system (running Slurm 17.02). Did TACC create a special Slurm PAM configuration file, and is this documented in the public

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-21 Thread Ole Holm Nielsen
On 03/21/2018 02:03 PM, Bill Barth wrote: I don’t think we had to do anything special since we have UsePAM = 1 in our slurm.conf. I didn’t do the install personally, but our pam.d/slurm* files are written by us and installed by our configuration management system. Not sure which one UsePAM loo

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-21 Thread Ole Holm Nielsen
12:08:00 (+0100), Ole Holm Nielsen wrote: One working solution is to modify the slurmd Systemd service file /usr/lib/systemd/system/slurmd.service to add a line: LimitCORE=0 This is a bit off-topic, but I see this a lot, so I thought I'd provide a friendly warning. The "right"

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-22 Thread Ole Holm Nielsen
On 03/22/2018 02:10 PM, Patrick Goetz wrote: > Or even better, don't think about it. If you type > >sudo systemctl edit slurmd > > this will open an editor. Type your changes into this and save it and > systemd will set up the snippet file for you automatically (in > etc/systemd/system/slurm

Re: [slurm-users] What's the best way to suppress core dump files from jobs?

2018-03-22 Thread Ole Holm Nielsen
On 03/21/2018 08:44 PM, Michael Jennings wrote: On Wednesday, 21 March 2018, at 20:14:22 (+0100), Ole Holm Nielsen wrote: Thanks for your friendly advice! I keep forgetting about Systemd details, and your suggestions are really detailed and useful for others! Do you mind if I add your advice

Re: [slurm-users] Restoring Slurm

2018-04-09 Thread Ole Holm Nielsen
On 04/09/2018 10:54 PM, Roberts, John E. wrote: The documentation is a little unclear to me, so I was wondering how do a complete backup and restore of Slurm for testing and/or disaster recovery. I'm looking to upgrade Slurm from 16.05.10 to the latest and I'm not sure all of what should go. I

Re: [slurm-users] Slurm setup question

2018-04-11 Thread Ole Holm Nielsen
Hi Matt, You might want to take a look at my Slurm Wiki, which focuses on CentOS/RHEL 7: https://wiki.fysik.dtu.dk/niflheim/SLURM. Complete instructions for Slurm installation, configuration, etc. is in the Wiki. /Ole On 04/11/2018 02:26 PM, Matt Hohmeister wrote: I’m brand-new to Slurm, an

Re: [slurm-users] ulimit in sbatch script

2018-04-15 Thread Ole Holm Nielsen
Hi Mahmood, It seems your compute node is configured with this limit: virtual memory (kbytes, -v) 72089600 So when the batch job tries to set a higher limit (ulimit -v 82089600) than permitted by the system (72089600), this must surely get rejected, as you have discovered! You may

Re: [slurm-users] What version I should install?

2018-04-16 Thread Ole Holm Nielsen
On 04/16/2018 08:20 PM, David Rodríguez Galiano wrote: Dear Slurm community, I am a sysadmin who needs to make a fresh installation of Slurm. When visiting the download website, I can see two different versions. The first is 17.02.10 and the second one is 17.11.5. I have not found information on

Re: [slurm-users] What version I should install?

2018-04-17 Thread Ole Holm Nielsen
On 04/17/2018 09:14 AM, David Rodríguez wrote: Thanks Chris! Thanks Ole! In fact, I followed your wiki. But I had many doubts in order to use version 17.11 or 17.02 because I don know the differences between them. Finally, I installed the last one. Always install the latest and greatest ver

Re: [slurm-users] Python code for munging hostfiles

2018-04-17 Thread Ole Holm Nielsen
On 04/17/2018 10:56 AM, John Hearns wrote: Please can some kind soul remind me what the Python code for mangling Slurm and PBS machinefiles is called please? We discussed it here about a year ago, in the context of running Ansys. I have a Cunning Plan (TM) to recode it in Julia, for no real re

Re: [slurm-users] ulimit in sbatch script

2018-04-17 Thread Ole Holm Nielsen
On 04/17/2018 04:38 PM, Mahmood Naderan wrote: That parameter is used in slurm.conf. Should I modify that only on the head node? Or all nodes? Then should I restart slurm processes? Yes, definitely! I collected the detailed instructions here: https://wiki.fysik.dtu.dk/niflheim/Slurm_configurat

Re: [slurm-users] sacct not shows user

2018-04-26 Thread Ole Holm Nielsen
Hi, Did you set up Slurm accounting? Some information is in my Wiki https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting /Ole On 04/26/2018 12:20 PM, sysadmin.caos wrote: Hello, when I run "sacct", output is this:    JobID    JobName  Partition    Account  AllocCPUS State ExitCode

Re: [slurm-users] sacctmgr - bug listing accounts?

2018-04-30 Thread Ole Holm Nielsen
Hi Loris, On 04/30/2018 10:12 AM, Loris Bennett wrote: Thanks, I should have spotted that, although I don't understand the difference between 'parent' and 'organisation' and in fact asked this question: https://groups.google.com/forum/#!topic/slurm-users/f1vftgIRcVk on the subject recently.

Re: [slurm-users] sacctmgr - bug listing accounts?

2018-04-30 Thread Ole Holm Nielsen
Hi Loris, On 04/30/2018 01:09 PM, Loris Bennett wrote: Your example of how to use 'Organisation' to setup separate groups within one department is illuminating. However, I am still unable to set up 'geochemie' as a sibling of 'geophysik' and a child of 'geowiss': $ sacctmgr list acc where a

Re: [slurm-users] slurmdbd: mysql/accounting errors on 17.11.6 upgrade

2018-05-07 Thread Ole Holm Nielsen
On 05/07/2018 10:19 PM, Tina Fora wrote: Hello, I upgraded from 17.02.10 to 17.11.6 on EL6.9 and getting the errors below. Database is on EL7 mariadb-5.5. Migrating to a new version of MySQL/MariaDB requires further steps on the database (unrelated to Slurm). You must run: mysql_upgrade a

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Ole Holm Nielsen
On 05/08/2018 08:44 AM, Bjørn-Helge Mevik wrote: Jonathon A Anderson writes: ## Queue stuffing There is the bf_max_job_user SchedulerParameter, which is sort of the "poor man's MAXIJOB"; it limits the number of jobs from each user the backfiller will try to start on each run. It doesn't do

Re: [slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-08 Thread Ole Holm Nielsen
On 05/08/2018 09:49 AM, John Hearns wrote: Actually what IS bad is users not putting cluster resources to good use. You can often see jobs which are 'stalled'  - ie the nodes are reserved for the job, but the internal logic of the job has failed and the executables have not launched. Or maybe s

Re: [slurm-users] Slurm source installation

2018-05-10 Thread Ole Holm Nielsen
On 10-05-2018 13:39, Valeriana wrote: Good Morning, I'm new to SLURM. I just installed  slurm-17.11.5.tar.bz2 source on a Master server (CentOS 7 17.08) with the followings plugins: DMTCP,Padb,Hostlist,Interactive Script,mpich, openmpi, Node Health Check,PEStat,HDF5,pam_slurm,PMIx and sqlog.

Re: [slurm-users] Slurm source installation

2018-05-10 Thread Ole Holm Nielsen
On 10-05-2018 16:56, Valeriana wrote: Hi Ole! Thanks for you help. I already checked this installation, but it didn't help me much. I am not using rpm, I am installing direct from the source code (configure, make and make install process). My question is: do I need these plugins on the computat

Re: [slurm-users] X11 debug

2018-05-17 Thread Ole Holm Nielsen
On 05/17/2018 08:45 AM, Nadav Toledo wrote: Hello everyone, After fighting with x11 forwarding couple of weeks, I think i've got a few tips that can help others. I am using slurm 17.11.6 with builtin x11 forwarding with ubuntu server distro, all servers in cluster share /home via beegfs. slu

Re: [slurm-users] Getting nodes in a partition

2018-05-18 Thread Ole Holm Nielsen
enat. On 18 May 2018 at 09:11, Mahmood Naderan <mailto:mahmood...@gmail.com>> wrote: Hi,> Regards, Mahmood -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Building 307, DK-2800 Kongens Lyngby, Denmark E-mail:

Re: [slurm-users] Upgrade woes

2018-05-31 Thread Ole Holm Nielsen
Hi Lachlan, Slurm upgrades on CentOS 7.5 should run without problems. It seems to me that your problems are unrelated to the Slurm RPMs. FWIW, I documented the Munge and Slurm installation as well as upgrade process in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation Hope

Re: [slurm-users] slurm limits

2018-06-08 Thread Ole Holm Nielsen
On 08-06-2018 10:28, Angelines wrote: I am new with slurm, before I have been admining a cluster with PBS/MOAB, but now I have a cluster with 23 nodes each one with 40 cores and in this case we have installed SLURM. I need to put limits to a user, this user cant use more than 250 cores at ti

Re: [slurm-users] Job Resource Utilization Summary Email

2018-06-13 Thread Ole Holm Nielsen
On 06/12/2018 06:06 PM, Hanby, Mike wrote: Is anyone aware of any existing job completion email scripts that provide a summary of the jobs resource utilization? For example, something like: Job ID: 123456 Cluster: HPC User/Group: jdoe/jdoe State: COMPLETED (exit code 0) Cores: 1 CPU Utili

Re: [slurm-users] Generating OPA topology.conf

2018-06-14 Thread Ole Holm Nielsen
Hi Jeffrey, On 06/13/2018 10:35 PM, Jeffrey Frey wrote: Intel's OPA doesn't include the old IB net discovery library/API; instead, they have their own library to enumerate nodes, links, etc.  I've started a rewrite of ye olde "ib2slurm" utility to make use of Intel's new enumeration library.

[slurm-users] Re: Why is Slurm 20 the latest RPM in RHEL 8/Fedora repo?

2024-01-31 Thread Ole Holm Nielsen via slurm-users
On 1/31/24 09:02, Bjørn-Helge Mevik via slurm-users wrote: This isn't answering your question, but I strongly suggest you build Slurm from source. You can use the provided slurm.spec file to make rpms (we do) or use "configure + make". Apart from being able to upgrade whenever a new version is

[slurm-users] Re: URL for how to do for SLURM accounting setup

2024-02-15 Thread Ole Holm Nielsen via slurm-users
On 2/16/24 07:01, John Joseph via slurm-users wrote: we were able to setup a test SLURM based system, with 4 nodes , Ubuntu 22.04 LTS and we were able to run COMSOL using "comsol batch" command Now we plan to have accounting https://slurm.schedmd.com/accounting.html

[slurm-users] Slurm management of dual-node server trays?

2024-02-23 Thread Ole Holm Nielsen via slurm-users
any ideas and insights! Ole [1] https://lenovopress.lenovo.com/lp1612-lenovo-thinksystem-sd665-v3-server [2] https://lenovopress.lenovo.com/lp1693-thinksystem-nvidia-connectx-7-ndr200-infiniband-qsfp112-adapters [3] https://support.lenovo.com/us/en/solutions/ht510888-thinksystem-sd650-and-connectx-6-hdr-sharedio-

[slurm-users] Re: Slurm management of dual-node server trays?

2024-03-07 Thread Ole Holm Nielsen via slurm-users
ats a Very interesting design and looking at the SD665 V3 documentation am I correct each node has dual 25GBs SFP28 interfaces? If so, the despite dual nodes in a 1u configuration, you actually have 2 separate servers? Sid On Fri, 23 Feb 2024, 22:40 Ole Holm Nielsen via slurm-users, mailto:s

[slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource

2024-03-14 Thread Ole Holm Nielsen via slurm-users
Hi Simon, Maybe you could print the user's limits using this tool: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits Which version of Slurm do you run? /Ole On 3/14/24 17:47, Simon Andrews via slurm-users wrote: Our cluster has developed a strange intermittent behaviour

[slurm-users] Re: Lua script

2024-03-20 Thread Ole Holm Nielsen via slurm-users
What is the contents of your /etc/slurm/job_submit.lua file? Did you reconfigure slurmctld? Check the log file by: grep job_submit /var/log/slurm/slurmctld.log What is your Slurm version? You can read about job_submit plugins in this Wiki page: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_conf

[slurm-users] Munge log-file fills up the file system to 100%

2024-04-15 Thread Ole Holm Nielsen via slurm-users
s seen the present Munge issue as well? Are there any good recommendations for setting the fs.file-max parameter on Slurm compute nodes? Thanks for sharing your insights, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark -- slurm-users mailing

[slurm-users] Re: Munge log-file fills up the file system to 100%

2024-04-16 Thread Ole Holm Nielsen via slurm-users
I can't figure out if 0.5.16 has a fix for the issue seen here? Questions: Have other sites seen the present Munge issue as well? Are there any good recommendations for setting the fs.file-max parameter on Slurm compute nodes? Thanks for sharing your insights, Ole -- Ole Holm

<    1   2   3   4   5   6   >