Re: [slurm-users] Generating OPA topology.conf

2018-06-14 Thread Ole Holm Nielsen
d libopamgt-devel packages. So, since you seem to use the same version as me, I'm not sure why you have these linking problems :/ Best Marcus On 06/14/2018 09:17 AM, Ole Holm Nielsen wrote: Hi Jeffrey, On 06/13/2018 10:35 PM, Jeffrey Frey wrote: Intel's OPA doesn't include the

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2018-07-18 Thread Ole Holm Nielsen
On 07/18/2018 10:56 AM, Roshan Thomas Mathew wrote: We ran into this issue trying to move from 16.05.3 -> 17.11.7 with 1.5M records in job table. In our first attempt, MySQL reported "ERROR 1206 The total number of locks exceeds the lock table size" after about 7 hours. Increased InnoDB Buff

[slurm-users] Execute parallel commands on all nodes running jobs of a particular user

2018-07-19 Thread Ole Holm Nielsen
Hi Slurm users, We have found the need to execute a parallel command on all nodes running jobs belonging to a particular user. I have made a configuration to the excellent ClusterShell tool as documented in https://wiki.fysik.dtu.dk/niflheim/SLURM#clustershell If you add a "slurmuser" secti

Re: [slurm-users] Execute parallel commands on all nodes running jobs of a particular user

2018-08-07 Thread Ole Holm Nielsen
On 06-08-2018 12:53, Bjørn-Helge Mevik wrote: There is also a Slurm plugin for pdsh (unfortunately not enabled in the default redhat/centos RPMs) that lets you run a command on each node belonging to a specific job with "pdsh -j ". Not exactly the same, though. :) Bjørn, that is a different t

Re: [slurm-users] slurmdbd upgrade startup error

2018-08-14 Thread Ole Holm Nielsen
Hi Tina, Is it the same OS version for 17.02 and 17.11, or are you upgrading the OS (and possibly the MySQL/MariaDB) at the same time? I assume you're testing the Slurm upgrade on a test server and not the production cluster? Did you check the steps mentioned in the thread "slurmdbd: mysql/

[slurm-users] Any information about the Slurm User Group Meeting 2018?

2018-09-10 Thread Ole Holm Nielsen
Regarding the Slurm User Group Meeting 2018 coming up in Madrid, Spain in two weeks from now: Has anyone heard information about hotels and the schedule? The official page https://slurm.schedmd.com/slurm_ug_agenda.html was last updated on May 30... /Ole

Re: [slurm-users] Any information about the Slurm User Group Meeting 2018?

2018-09-10 Thread Ole Holm Nielsen
hotel? Thanks, Ole On 10-09-2018 17:33, Jacob Jenson wrote: Ole, You can find hotels close to CIEMAT here https://drive.google.com/open?id=1eEKgnlBXeYNO426QS7nPuDS4nm8aUpnH&usp=sharing Jacob On Mon, Sep 10, 2018 at 1:23 AM, Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote:

Re: [slurm-users] Create users

2018-09-12 Thread Ole Holm Nielsen
On 12-09-2018 18:21, Andre Torres wrote: I’m new to slurm and I’m confused regarding user creation. I have an installation with 1 login node and 5 compute nodes. If I create a user across all the nodes with the same uid and gid I can execute jobs but I can’t understand the difference between us

Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08

2018-09-27 Thread Ole Holm Nielsen
On 09/27/2018 10:33 AM, Bjørn-Helge Mevik wrote: Baker D.J. writes: I guess that the question that comes to mind is.. Is it a really big deal if the slurmctld process is down whilst the slurmdbd is being upgraded? I tend to always stop slurmctld before upgrading slurmdbd, and have never noti

Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08

2018-09-28 Thread Ole Holm Nielsen
On 09/27/2018 05:12 PM, Christopher Benjamin Coffey wrote:> 2. Purge/archive unneeded jobs/steps before the upgrade, to make the upgrade as quick as possible: slurmdbd.conf: ArchiveDir=/common/adm/slurmdb_archive ArchiveEvents=yes ArchiveJobs=yes ArchiveSteps=no ArchiveResvs=no ArchiveSuspend=

Re: [slurm-users] CPU allocation within a node is not cyclic

2018-10-06 Thread Ole Holm Nielsen
On 06-10-2018 04:15, 崔灏 (CUI Hao) wrote: $ scontrol reconfigure slurm_reconfigure error: SelectType change requires restart of the slurmctld daemon to take effect I'm afraid that restarting slurmctld will interrupt current tasks, so I'm still waiting for them to finish. There should be no prob

Re: [slurm-users] slurmdbd not showing job accounting

2018-10-14 Thread Ole Holm Nielsen
On 14-10-2018 06:30, Steven Dick wrote: I've found that when creating a new cluster, slurmdbd does not function correctly right away. It may be necessary to restart slurmdbd at several points during the slurm installation process to get everything working correctly. Also, slurmctld will buffer

Re: [slurm-users] slurmdbd not showing job accounting

2018-10-14 Thread Ole Holm Nielsen
, and what difference it makes when slurmdbd is restarted repeatedly. Are you up for this task? /Ole On Sun, Oct 14, 2018 at 4:12 AM Ole Holm Nielsen wrote: Correct, and this is documented in the Slurm accounting setup page: https://slurm.schedmd.com/accounting.html#database-configuration

Re: [slurm-users] Qlustar 10.1 adds CentOS/OpenHPC support

2018-10-16 Thread Ole Holm Nielsen
Hi Roland, That website is improperly configured. My Firefox browser says: qlustar.com uses an invalid security certificate. The certificate is only valid for the following names: docs.qlustar.com, www.qlustar.com Error code: SSL_ERROR_BAD_CERT_DOMAIN /Ole On 10/16/2018 02:27 PM, Roland F

Re: [slurm-users] User permissions on submitted jobs

2018-10-17 Thread Ole Holm Nielsen
On 17-10-2018 20:13, Aravindh Sampathkumar wrote: I built a SLURM cluster and am able to successfully run jobs as root. However, when I try to submit jobs as a regular user, I hit permission problems. username@console:[~] > srun -N1 /bin/hostname slurmstepd: error: couldn't chdir to `/usr/home

Re: [slurm-users] pam_slurm_adopt does not constrain memory?

2018-10-25 Thread Ole Holm Nielsen
On 10/25/2018 07:00 AM, Christopher Samuel wrote: On 25/10/18 2:29 pm, Christopher Samuel wrote: Could explain why this isn't something we see consistently, and why we're both seeing it currently. This seems to be a handy way to find any processes that are not properly constrained by Slurm c

[slurm-users] Updated Slurm tool "pestat" (Processor Element status)

2018-11-21 Thread Ole Holm Nielsen
nted after each jobid/user -C: Color output is forced ON -c: Color output is forced OFF -h: Print this help information -V: Version information My monitoring of jobs is usually done simply with "pestat -F", and also with "pestat -s mix". /Ole -

Re: [slurm-users] How to check the percent cpu of a job?

2018-11-21 Thread Ole Holm Nielsen
Hi Yalei, On 21-11-2018 18:51, 宋亚磊 wrote: How to check the percent cpu of a job in slurm? I tried sacct, sstat, squeue, but I can't find that how to check. Can someone help me? I would recommend my "pestat" tool, which was also announced on the list today. The CPUload is one of the many sta

Re: [slurm-users] How to check the percent cpu of a job?

2018-11-21 Thread Ole Holm Nielsen
On 21-11-2018 19:41, Ryan Novosielski wrote: Olm’s “pestat” script does allow you to get similar information, but I’m interested to see if indeed there’s a better answer. I’ve used his script for more or less the same reason, to see if the jobs are using the resources they’re allocated. They s

Re: [slurm-users] How to check the percent cpu of a job?

2018-11-22 Thread Ole Holm Nielsen
On 11/22/2018 12:10 AM, Christopher Samuel wrote: I've just had a quick play with pestat and it reveals that Slurm 18.08.3 seems to have some odd ideas about load on nodes, for instance one of our KNL nodes that is offline is reported with a CPUload of 2.70, but I can see nothing running on it an

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

2018-11-30 Thread Ole Holm Nielsen
On 29-11-2018 19:27, Christopher Benjamin Coffey wrote: We've been noticing an issue with nodes from time to time that become "wedged", or unusable. This is a state where ps, and w hang. We've been looking into this for a while when we get time and finally put some more effort into it yesterday

Re: [slurm-users] Accounting: Default Associations for Unknown Accounts

2018-12-21 Thread Ole Holm Nielsen
FWIW, I have made some scripts to automate the creation of Slurm accounts from the passwd database (not LDAP), see https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmaccounts I hope this helps you getting started with Slurm. -- Ole Holm Nielsen PhD, Senior HPC Officer Department o

[slurm-users] New checktopology tool: Check consistency of /etc/slurm/topology.conf with nodelist in /etc/slurm/slurm.conf

2019-01-21 Thread Ole Holm Nielsen
d001 d002 d003 *** *** 595,600 --- 600,606 i048 i049 i050 + i051 x001 x002 x003 Comments and suggestions are most welcome! FYI: My Slurm Wiki contains available information about adding/removing nodes: https://wiki.fysik.dtu.dk/niflheim/SLURM#add-and-remove-nodes -- Ole Holm N

Re: [slurm-users] New checktopology tool: Check consistency of /etc/slurm/topology.conf with nodelist in /etc/slurm/slurm.conf

2019-01-21 Thread Ole Holm Nielsen
Hi Bjørn-Helge, Thanks: On 1/21/19 12:37 PM, Bjørn-Helge Mevik wrote: Two more details/enhancements: 1) Sites which use node names like c[1-20]-[1-36], would benefit from "sort -V" instead of just sort -- otherwise c10-12 will be listed before c2-12, for instance. (For sites that use names li

Re: [slurm-users] New checktopology tool: Check consistency of /etc/slurm/topology.conf with nodelist in /etc/slurm/slurm.conf

2019-01-21 Thread Ole Holm Nielsen
On 1/21/19 12:18 PM, Bjørn-Helge Mevik wrote: Ole Holm Nielsen writes: Comments and suggestions are most welcome! Splendid tool! I immediately found that I'd forgotten to take a few nodes out of the topology definition. :) Me too :-) One thing: The script doesn't work i

Re: [slurm-users] New Bright Cluster Slurm issue for AD users

2019-02-13 Thread Ole Holm Nielsen
ailable in this page: https://c4science.ch/source/slurm-accounts/ -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] Slurmd not starting

2019-02-13 Thread Ole Holm Nielsen
Hi Nathalie, Which Slurm version and which OS version are you using? FYI: My Slurm Wiki contains all the details of setting up Slurm on CentOS 7: https://wiki.fysik.dtu.dk/niflheim/SLURM Best regards, Ole On 2/13/19 2:58 PM, Nathalie Gocht wrote: Hey, I am building up a one node cluster. M

Re: [slurm-users] maximum size of array jobs

2019-02-26 Thread Ole Holm Nielsen
On 2/26/19 9:07 AM, Marcus Wagner wrote: Does anyone know, why per default the number of array elements is limited to 1000? We have one user, who would like to have 100k array elements! What is more difficult for the scheduler, one array job with 100k elements or 100k non-array jobs? Where

[slurm-users] Migrate the slurmdbd service to another server

2019-03-01 Thread Ole Holm Nielsen
cluster. Upgrading slurmctld and slurmd is another topic, and this is discussed in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm. I'd appreciate comments and suggestions about my procedure. /Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Phys

Re: [slurm-users] Migrate the slurmdbd service to another server

2019-03-04 Thread Ole Holm Nielsen
ctld but with the reorg I'm taking a downtime for the dbd upgrade. That's not too bad though as we pause all our jobs out of paranoia for upgrades. My strategy is to avoid any downtime at all = lost productivity. /Ole On 3/1/19 8:10 AM, Ole Holm Nielsen wrote: We're one of t

Re: [slurm-users] Migrate the slurmdbd service to another server

2019-03-04 Thread Ole Holm Nielsen
On 3/4/19 2:26 PM, Loris Bennett wrote: Ole Holm Nielsen writes: We're one of the many Slurm sites which run the slurmdbd database daemon on the same server as the slurmctld daemon. This works without problems at our site given our modest load, however, SchedMD recommends to run the da

Re: [slurm-users] Migrate the slurmdbd service to another server

2019-03-04 Thread Ole Holm Nielsen
On 04-03-2019 16:30, Loris Bennett wrote: On 3/4/19 2:26 PM, Loris Bennett wrote: Ole Holm Nielsen writes: We're one of the many Slurm sites which run the slurmdbd database daemon on the same server as the slurmctld daemon. This works without problems at our site given our modest

Re: [slurm-users] How to list available CPUs/GPUs for jobs

2019-03-08 Thread Ole Holm Nielsen
On 3/8/19 1:59 PM, Frava wrote: I'm replying to the "[slurm-users] Available gpus ?" post. Some time ago I did a BASHv4 script in for listing the available CPU/RAM/GPU on the nodes. It parses the output of the "scontrol -o -d show node" command and displays what I think is needed to launch GPU

Re: [slurm-users] Database Tuning w/SLURM

2019-03-22 Thread Ole Holm Nielsen
On 3/21/19 6:56 PM, Ryan Novosielski wrote: On Mar 21, 2019, at 12:21 PM, Loris Bennett wrote: Our last cluster only hit around 2.5 million jobs after around 6 years, so database conversion was never an issue. For sites with a higher-throughput things may be different, but I would hope that

Re: [slurm-users] Changing node weights in partitions

2019-03-22 Thread Ole Holm Nielsen
On 3/22/19 4:15 PM, José A. wrote: Dear all, I would like to create two partitions, A and B, in which node1 had a certain weight in partition A and a different one in partition B. Does anyone know how to implement it? Some pointers to documentation of this and a practical example is in my W

Re: [slurm-users] Changing node weights in partitions

2019-03-22 Thread Ole Holm Nielsen
filling node2. Can I accomplish this behavior through weighting the nodes? With your example I’m afraid to say it’s not still clear to me how. Thanks a lot for your help. José On 22. Mar 2019, at 16:29, Ole Holm Nielsen wrote: On 3/22/19 4:15 PM, José A. wrote: Dear all, I would like to

Re: [slurm-users] Changing node weights in partitions

2019-03-24 Thread Ole Holm Nielsen
Hi José, On 23-03-2019 19:59, Jose A wrote: You got my point. I want a way in which a partition influences the priority with a node takes new jobs. Any tip will be really appreciated. Thanks a lot. Would PriorityWeightPartition as defined with the Multifactor Priority Plugin (https://slurm.

Re: [slurm-users] spart: A user-oriented partition info command for slurm

2019-03-27 Thread Ole Holm Nielsen
Hi Ahmet, On 3/27/19 10:51 AM, mercan wrote: Except sjstat script, Slurm does not contains a command to show user-oriented partition info. I wrote a command. I hope you will find it useful. https://github.com/mercanca/spart Thanks for a very useful new Slurm command! /Ole

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2019-04-02 Thread Ole Holm Nielsen
Hi Lech, IMHO, the Slurm user community would benefit the most from your interesting work on MySQL/MariaDB performance, if your patch could be made against the current 18.08 and the coming 19.05 releases. This would ensure that your work is carried forward. Would you be able to make patches

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2019-04-03 Thread Ole Holm Nielsen
t offset. Kind regards, Lech Am 02.04.2019 um 15:18 schrieb Ole Holm Nielsen : Hi Lech, IMHO, the Slurm user community would benefit the most from your interesting work on MySQL/MariaDB performance, if https://bugs.schedmd.com/show_bug.cgi?id=6796your patch could be made against the curren

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2019-04-03 Thread Ole Holm Nielsen
, Lech Nieroda wrote: Hi Ole, Am 03.04.2019 um 12:53 schrieb Ole Holm Nielsen : SchedMD already decided that they won't fix the problem: Yes, I guess it’s a bit late in the release lifecycles. Nevertheless it’s a pity, as there are certainly a lot of users around who’d rather not upgrade

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2019-04-05 Thread Ole Holm Nielsen
Hi Lech, I've tried to summarize your work on the Slurm database upgrade patch in my Slurm Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#database-upgrade-from-slurm-17-02-and-older Could you kindly check if my notes are correct and complete? Hopefully this Wiki will also h

Re: [slurm-users] Extreme long db upgrade 16.05.6 -> 17.11.3

2019-04-05 Thread Ole Holm Nielsen
s the default for RHEL6 but it’s the default for RHEL7, isn’t it? Assuming that you use RHEL7/CentOS7 with mysql 5.5, have you checked how long your upgrade would take with the patch? Kind regards, Lech >> -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical Un

Re: [slurm-users] slurmdbd purge not working

2019-04-05 Thread Ole Holm Nielsen
Hi Julien, Did you optimize the MySQL database, in particular InnoDB? I have collected some documentation in my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_database#mysql-configuration and I also discuss database purging. Please note that we run Slurm 17.11 (and recently 18.08) on Cent

Re: [slurm-users] slurmdbd purge not working

2019-04-05 Thread Ole Holm Nielsen
On 4/5/19 4:28 PM, Julien Rey wrote: The failure occurs after a few minutes (~10). And we are running out of space on the slurm controller. The mysql daemon is at 100% CPU usage all the time. This issue is becoming critical. ... Our slurm accounting database is growing bigger and bigger (more

Re: [slurm-users] Having Issue in Slurm cluster setup

2019-04-08 Thread Ole Holm Nielsen
On 09-04-2019 07:37, sudhagar s wrote: Hi, Iam newbee in slurm. trying to setup a cluster for ML training purpose. i created controle node and compute node. both are up and running. when i enter "srun -N 1 hostname" it says " srun error memory specification can not be satisfied" "unable to allo

Re: [slurm-users] Having Issue in Slurm cluster setup

2019-04-08 Thread Ole Holm Nielsen
On 09-04-2019 08:25, sudhagar s wrote: Thanks For the response. here is my node  and partition information: Well, 1 MB of real memory in the node is not a lot :-) This reminds me of the very old days where PCs had 640 kB RAM... On Tue, Apr 9, 2019 at 11:53 AM Ole Holm Nielsen

Re: [slurm-users] Having Issue in Slurm cluster setup

2019-04-08 Thread Ole Holm Nielsen
onfigured slurm.conf incorrectly. On Tue, Apr 9, 2019 at 11:53 AM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: On 09-04-2019 07:37, sudhagar s wrote: > Hi, Iam newbee in slurm. trying to setup a cluster for ML training > purpose. i created controle node and c

Re: [slurm-users] [slurm-us> , -- , Ole Holm Nielsen, PhD, Senior HPC Officer, Department of Physics, Technical University of Denmark, , Building 307, DK-2800 Kongens Lyngby, Denmark, E-mail: ole.h.ni

2019-04-09 Thread Ole Holm Nielsen
uot;). The default value is 1. On 4/9/19 8:47 AM, sudhagar s wrote: Attaching my slurm.conf file. can you please help me to find the issue. On Tue, Apr 9, 2019 at 12:08 PM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: On 09-04-2019 08:33, sudhagar s wrote: >

Re: [slurm-users] Having Issue in Slurm cluster setup

2019-04-09 Thread Ole Holm Nielsen
uot;). The default value is 1. On 4/9/19 8:47 AM, sudhagar s wrote: Attaching my slurm.conf file. can you please help me to find the issue. On Tue, Apr 9, 2019 at 12:08 PM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: On 09-04-2019 08:33, sudhagar s wrote: >

Re: [slurm-users] How to get a summary of the use of compute nodes and/or partition of a cluster in real time ?

2019-04-30 Thread Ole Holm Nielsen
Hi Jean-Mathieu, On 4/30/19 2:47 PM, Jean-mathieu CHANTREIN wrote: Do you know a command to get a summary of the use of compute nodes and/or partition of a cluster in real time ? Something with a output like this: $ sutilization Partition/Node_Name CPU_Use CPU_Total %Use standard    236   

Re: [slurm-users] How to get a summary of the use of compute nodes and/or partition of a cluster in real time ?

2019-04-30 Thread Ole Holm Nielsen
On 30-04-2019 17:47, Jean-mathieu CHANTREIN wrote: Hello. That's exactly what I need. Thank you very much for your work. It surprises me that slurm does not provide an official solution for that ... Is there a page listing the tools (such as this one) that are being developed by the community?

Re: [slurm-users] Nodes not responding... how does slurm track it?

2019-05-15 Thread Ole Holm Nielsen
On 15-05-2019 09:34, Barbara Krašovec wrote: It could be a problem with ARP cache. If the number of devices approaches 512, there is a kernel limitation in dynamic ARP-cache size and it can result in the loss of connectivity between nodes. This is something every cluster owner should be awar

Re: [slurm-users] Submit job using srun fails but sbatch works

2019-05-29 Thread Ole Holm Nielsen
Hi Alexander, The error "can't find address for host cn7" would indicate a DNS problem. What is the output of "host cn7" from the srun host li1? How many network devices are in your subnet? It may be that the Linux kernel is doing "ARP cache trashing" if the number of devices approaches 51

Re: [slurm-users] Tiny feature request: sacct.1 man page should list SACCT_FORMAT

2019-06-26 Thread Ole Holm Nielsen
On 6/26/19 12:23 PM, John Marshall wrote: I have had $SQUEUE_FORMAT set in my environment for a long time, but have only today learnt that sacct will also listen to an environment variable to set a default output format. Previously I had only looked for it in the Environment Variables section

Re: [slurm-users] Tiny feature request: sacct.1 man page should list SACCT_FORMAT

2019-06-26 Thread Ole Holm Nielsen
On 6/26/19 1:14 PM, John Marshall wrote: On 26 Jun 2019, at 11:51, Ole Holm Nielsen wrote: You should open a case with SchedMD containing your patch: https://bugs.schedmd.com/ Yes, I considered creating a Bugzilla account at SchedMD so that I could send them a three-line patch. To be

Re: [slurm-users] getting closer

2019-06-28 Thread Ole Holm Nielsen
On 6/28/19 9:18 AM, Valerio Bellizzomi wrote: On Fri, 2019-06-28 at 08:51 +0200, Valerio Bellizzomi wrote: On Thu, 2019-06-27 at 18:35 +0200, Valerio Bellizzomi wrote: The nodes are now communicating however when I run the command srun -w compute02 /bin/ls it remains stuck and there is no out

Re: [slurm-users] getting closer

2019-06-28 Thread Ole Holm Nielsen
On 6/28/19 9:57 AM, Valerio Bellizzomi wrote: On Fri, 2019-06-28 at 09:39 +0200, Ole Holm Nielsen wrote: On 6/28/19 9:18 AM, Valerio Bellizzomi wrote: On Fri, 2019-06-28 at 08:51 +0200, Valerio Bellizzomi wrote: On Thu, 2019-06-27 at 18:35 +0200, Valerio Bellizzomi wrote: The nodes are now

Re: [slurm-users] Installation troubles

2019-07-01 Thread Ole Holm Nielsen
On 01-07-2019 21:47, HELLMERS Joe wrote: I’m having trouble installing Slurm 18.08.7 on Red Hat 7.3. I installed munge from source. It may be easier for you to install Slurm with RPMs. A complete guide is in my Slurm Wiki pages: https://wiki.fysik.dtu.dk/niflheim/SLURM https://wiki.fysik.d

Re: [slurm-users] dual slurmctld and slurmdbd

2019-07-03 Thread Ole Holm Nielsen
On 7/2/19 10:48 PM, Tina Fora wrote: We run mysql on a dedicated machine with slurmctld and slurmdbd running on another machine. Now I want to add another machine running slurmctld and slurmdbd and this machine with be on CentOS 7. Existing one is CentOS 6. Is this possible? Can I run two seperat

Re: [slurm-users] Hints, Cheatsheets, etc

2019-07-08 Thread Ole Holm Nielsen
Hi Edward, Besides my Slurm Wiki page https://wiki.fysik.dtu.dk/niflheim/SLURM, I have written a number of tools which we use for monitoring our cluster, see https://github.com/OleHolmNielsen/Slurm_tools. I recommend in particular these tools: * pestat Prints a Slurm cluster nodes status wi

Re: [slurm-users] Jobs waiting while plenty of cpu and memory available

2019-07-08 Thread Ole Holm Nielsen
Hi Edward, The squeue command tells you about job status. You can get extra information using format options (see the squeue man-page). I like to set this environment variable for squeue: export SQUEUE_FORMAT="%.18i %.9P %.6q %.8j %.8u %.8a %.10T %.9Q %.10M %.10V %.9l %.6D %.6C %m %R" Wh

Re: [slurm-users] Slurm topology.conf file

2019-07-09 Thread Ole Holm Nielsen
On 7/9/19 9:04 AM, Priya Mishra wrote: Hi, I am using the slurmibtopology tool to generate the topology.conf file from the cluster at my institute which gives me a file with around 400 nodes. I need a topology file with a larger no of nodes for further use. Is there anyway of generating a synt

Re: [slurm-users] Slurm topology.conf file

2019-07-09 Thread Ole Holm Nielsen
On 7/9/19 10:14 AM, Priya Mishra wrote: Hi Ole, I am using slurm emulator and would soon start working with the slurm simulator. I need these larger topology files for the purpose of a project and not actual job scheduling. If there are any suitable resources for me to use, please let me know.

[slurm-users] ANNOUNCE: A new showpartitions tool

2019-07-15 Thread Ole Holm Nielsen
s been really supportive in testing showpartitions during development and comparing the output to spart. Thorsten Deilmann from University of Wuppertal has offered a number of useful suggestions, including the colored output. Best regards, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] Fwd: Getting information about AssocGrpCPUMinutesLimit for a job

2019-08-11 Thread Ole Holm Nielsen
Andreas made a good suggestion of looking at the user's TRESRunMin from sshare in order to answer Jeff's question about AssocGrpCPUMinutesLimit for a job. However, getting at this information is in practice really complicated, and I don't think any ordinary user will bother to look it up. Due

[slurm-users] ANNOUNCE: A new showuserlimits tool for printing Slurm user resource limits and usage

2019-08-21 Thread Ole Holm Nielsen
il. Best regards, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark

Re: [slurm-users] ticking time bomb? launching too many jobs in parallel

2019-08-27 Thread Ole Holm Nielsen
Hi Guillaume, The performance of the slurmctld server depends strongly on the server hardware on which it is running! This should be taken into account when considering your question. SchedMD recommends that the slurmctld server should have only a few, but very fast CPU cores, in order to e

[slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-02 Thread Ole Holm Nielsen
And how can users specify the minimum *Available* disk space required by their jobs submitted by "sbatch"? If this is not feasible, are there other techniques that achieve the same goal? We're currently still at Slurm 18.08. Thanks, Ole -- Ole Holm Nielsen PhD, Senior

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-03 Thread Ole Holm Nielsen
the idea is to make the prolog set up a "project" disk quota for the job on the localtmp file system, and the epilog to remove it again. I'm not 100% sure we will make it work, but I'm hopeful. Fingers crossed! :) On 9/2/19 8:02 PM, Ole Holm Nielsen wrote:> We have some u

Re: [slurm-users] slurm node weights

2019-09-08 Thread Ole Holm Nielsen
You should be able to assign node weights to accommodate your prioritization wishes. I've summarized this setting in my Slurm Wiki page: https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-weight I hope this helps. /Ole On 9/5/19 5:48 PM, Douglas Duckworth wrote: Hello We added some

Re: [slurm-users] How can jobs request a minimum available (free) TmpFS disk space?

2019-09-10 Thread Ole Holm Nielsen
have an NHC check "check_fs_used /scratch 90%"). Best regards, Ole On 10-09-2019 20:41, Michael Jennings wrote: On Monday, 02 September 2019, at 20:02:57 (+0200), Ole Holm Nielsen wrote: We have some users requesting that a certain minimum size of the *Available* (i.e., free) T

Re: [slurm-users] Removing user from slurm configuration

2019-10-10 Thread Ole Holm Nielsen
sacctmgr delete user XXX I would also like to mention my Slurm account and user updating tools: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmaccounts /Ole On 10/10/19 1:41 PM, Mahmood Naderan wrote: Hi I had created multiple test users, and then removed them. However, I see t

Re: [slurm-users] How to find core count per job per node

2019-10-18 Thread Ole Holm Nielsen
On 18-10-2019 19:56, Tom Wurgler wrote: I need to know how many cores a given job is using per node. Say my nodes have 24 cores each and I run a 36 way job. It take a node and a half. scontrol show job id shows me 36 cores, and the 2 nodes it is running on. But I want to know how it split the job

Re: [slurm-users] Running mix versions of slurm while upgrading

2019-10-20 Thread Ole Holm Nielsen
FWIW, you may be interested in my Wiki on upgrading Slurm: https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm You should also read the pages on Upgrading in the presentation Technical: Field Notes From A MadMan, Tim Wickberg, SchedMD from last month's Slurm User Group meeting

Re: [slurm-users] [EXT] Re: How to find core count per job per node

2019-10-21 Thread Ole Holm Nielsen
-- *From:* slurm-users on behalf of Ole Holm Nielsen *Sent:* Friday, October 18, 2019 2:15 PM *To:* slurm-users@lists.schedmd.com *Subject:* [EXT] Re: [slurm-users] How to find core count per job per node WARNING: This is an EXTERNAL email. Please think before RESPONDING or CLICKING

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Ole Holm Nielsen
Hi, Maybe my Slurm Wiki can help you build SLurm on CentOS/RHEL 7? See https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#build-slurm-rpms Note in particular: Important: Install the MariaDB (a replacement for MySQL) packages before you build Slurm RPMs (otherwise some libraries will be mis

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Ole Holm Nielsen
iaDB-shared on every server that will run slurmd, i.e. all compute nodes. I expect that if I looked harder at the build options there may be a way to do this, perhaps with linker flags. For now, I can progress. Thanks William -Original Message- From: slurm-users On Behalf Of Ole Ho

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Ole Holm Nielsen
#x27;s required by any of the mariadb packages, it'll get pulled automatically. If not, you don't need it on the build system. On 11/11/19 10:56 PM, Ole Holm Nielsen wrote: Hi William, Interesting experiences with MariaDB 10.4!  I tried to collect the instructions from the MariaDB p

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-11 Thread Ole Holm Nielsen
On 11/12/19 8:10 AM, Nguyen Dai Quy wrote: I have the same issue by compiling RPM. Just add "--with mysql" at rpmbuild option and the error gone :-) HTH, That's an interesting observation! Do you know what the "--with mysql" actually does? IMHO, the Slurm .spec file should include all requ

Re: [slurm-users] RPM build error - accounting_storage_mysql.so

2019-11-12 Thread Ole Holm Nielsen
Hi Daniel, Thanks for sharing your insights! I have updated my Wiki page https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#install-mariadb-database now. /Ole On 11/12/19 8:52 AM, Daniel Letai wrote: On 11/12/19 9:34 AM, Ole Holm Nielsen wrote: On 11/11/19 10:14 PM, Daniel Letai

Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Ole Holm Nielsen
On 13-11-2019 18:04, Bas van der Vlies wrote: We have currently version 18.08.7 installed on our cluster and want to upgrade to 19.03.3.. So I wanted to start small and installed it one of our compute node. Buy if I start the 'slurmd' then our slurmctld will complain that: {{{ 2019-11-13T17:49

Re: [slurm-users] Slurm 19.05: can not submit job

2019-11-28 Thread Ole Holm Nielsen
On 11/28/19 10:35 AM, Nguyen Dai Quy wrote: Hi list, I can not submit my job: > sbatch submit.sh sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified After checking slurmdbd.log, I see: [2019-11-28T10:21:07.578] Accounting storage MYSQL plugi

Re: [slurm-users] Slurm 19.05: can not submit job

2019-11-28 Thread Ole Holm Nielsen
On 11/28/19 11:47 AM, Nguyen Dai Quy wrote: On Thu, Nov 28, 2019 at 11:20 AM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: On 11/28/19 10:35 AM, Nguyen Dai Quy wrote: > Hi list, > I can not submit my job: >  > sbatch submit.sh > sba

Re: [slurm-users] (no subject)

2019-12-08 Thread Ole Holm Nielsen
Hi Dean, You may want to look at the links in my Slurm Wiki page. Both the official Slurm documentation and other resources are listed. I think most of your requirements and questions are described in these pages. My Wiki gives detailed deployment information for a CentOS 7 cluster, but mu

Re: [slurm-users] (no subject)

2019-12-08 Thread Ole Holm Nielsen
Forgot the link to the Wiki: https://wiki.fysik.dtu.dk/niflheim/SLURM On 12/8/19 9:18 PM, Ole Holm Nielsen wrote: Hi Dean, You may want to look at the links in my Slurm Wiki page.  Both the official Slurm documentation and other resources are listed.  I think most of your requirements and

Re: [slurm-users] Upgraded Slurm 17.02 to 19.05, now GRPTRESRunMin limits are applied incorrectly

2019-12-16 Thread Ole Holm Nielsen
Hi Mike, My showuserlimits tool prints nicely user limits from the Slurm database: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits Maybe this can give you further insights into the source of problems. /Ole On 16-12-2019 17:27, Renfro, Michael wrote: Hey, folks. I’ve j

Re: [slurm-users] [External] Re: Partition question

2019-12-19 Thread Ole Holm Nielsen
Some examples are here: https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting#quality-of-service-qos /Ole On 19-12-2019 19:30, Prentice Bisbal wrote: On 12/19/19 10:44 AM, Ransom, Geoffrey M. wrote: The simplest is probably to just have a separate partition that will only allow job times of 1

[slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-18 Thread Ole Holm Nielsen
When we have created a new Slurm user with "sacctmgr create user name=xxx", I would like inquire at a later date about the timestamp for the user creation. As far as I can tell, the sacctmgr command cannot show such timestamps. I assume that the Slurm database contains the desired timestamp(?

Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Ole Holm Nielsen
Hi Jürgen, On 1/19/20 2:38 PM, Juergen Salk wrote: * Ole Holm Nielsen [200118 12:06]: When we have created a new Slurm user with "sacctmgr create user name=xxx", I would like inquire at a later date about the timestamp for the user creation. As far as I can tell, the sacctmgr comm

Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Ole Holm Nielsen
and suggestions for improvement are welcome! /Ole On 1/18/20 12:06 PM, Ole Holm Nielsen wrote: When we have created a new Slurm user with "sacctmgr create user name=xxx", I would like inquire at a later date about the timestamp for the user creation.  As far as I can tell, the sacctm

Re: [slurm-users] How to print a user's creation timestamp from the Slurm database?

2020-01-20 Thread Ole Holm Nielsen
te -d "-45 days" +%m/%d/%y` I think this pretty nicely gives us the flexibility for listing transactions during some period into the past. /Ole On 1/20/20 11:29 AM, Ole Holm Nielsen wrote: Hi Jürgen, On 1/19/20 2:38 PM, Juergen Salk wrote: * Ole Holm Nielsen [200118 12:06]: When

Re: [slurm-users] Question about slurm source code and libraries

2020-01-25 Thread Ole Holm Nielsen
On 24-01-2020 20:22, Dean Schulze wrote: Since there isn't a list for slurm development I'll ask here.  Does the slurm code include a library for making REST calls?  I'm writing a plugin that will make REST calls and if slurm already has one I'll use that, otherwise I'll find one with an approp

Re: [slurm-users] Print slurm cgroup parameters

2020-01-27 Thread Ole Holm Nielsen
On 27-01-2020 20:35, Mahmood Naderan wrote: Hi Is there any command to print current cgroup parameters or configurations that are used by Slurm? This works for me: # scontrol show config | tail -22 Cgroup Support Configuration: AllowedDevicesFile = /etc/slurm/cgroup_allowed_devices_file

Re: [slurm-users] Which ports does slurm use?

2020-02-07 Thread Ole Holm Nielsen
On 06-02-2020 22:40, Dean Schulze wrote: I've moved two nodes to a different controller.  The nodes are wired and the controller is networked via wifi.  I had to open up ports 6817 and 6818 between the wired and wireless sides of our network to get any connectivity. Now when I do srun -N2 ho

Re: [slurm-users] Which ports does slurm use?

2020-02-10 Thread Ole Holm Nielsen
they work again. -Original Message- From: slurm-users On Behalf Of Ole Holm Nielsen Sent: Friday, February 7, 2020 2:34 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Which ports does slurm use? On 06-02-2020 22:40, Dean Schulze wrote: I've moved two nodes to a differ

Re: [slurm-users] Job limit in slurm.

2020-02-17 Thread Ole Holm Nielsen
On 2/17/20 11:16 AM, navin srivastava wrote: i have an issue with the slurm job limit. i applied the Maxjobs limit on user using  sacctmgr modify user navin1 set maxjobs=3 but still i see this is not getting applied. i am still bale to submit more jobs. Slurm version is 17.11.x Let me know

Re: [slurm-users] Job limit in slurm.

2020-02-17 Thread Ole Holm Nielsen
tava wrote: Hi, Thanks for your script. with this i am able to show the limit what i set. but this limt is not working. MaxJobs =        3, current value = 0 Regards Navin. On Mon, Feb 17, 2020 at 4:13 PM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: On 2/17/20

Re: [slurm-users] Job limit in slurm.

2020-02-17 Thread Ole Holm Nielsen
limit is set it should allow only 3 jobs at any point of time. Regards Navin. On Mon, Feb 17, 2020 at 4:48 PM Ole Holm Nielsen mailto:ole.h.niel...@fysik.dtu.dk>> wrote: Hi Navin, Why do you think the limit is not working?  The MaxJobs limits the number of running jobs to

Re: [slurm-users] Cluster usage with Slurm

2020-02-17 Thread Ole Holm Nielsen
On 2/17/20 1:19 PM, Parag Khuraswar wrote: Hi Team, Does Slurm  provide cluster usage reports like mentioned below ? Detailed reports about cluster usage statistics. Reports of every user and jobs including their monthly usage, node usage, percentage of utilization, History tracking, number of

  1   2   3   4   5   6   >