[slurm-users] ubuntu 16.04 > 18.04
Thinking about upgrading to Ubuntu 18.04 on my workstation, where I am running a single node slurm setup. Any issues any one has run across in the update? Thanks! ashton
[slurm-users] swap size
I have a single node slurm config on my workstation (18 cores, 256 gb ram, 40 Tb disk space). I recently just extended the array size to its current config and am reconfiguring my LVM logical volumes. I'm curious on people's thoughts on swap sizes for a node. Redhat these days recommends up to 20% of ram size for swap size, but no less than 4 gb. But..according to slurm faq; "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals respectively, so swap and disk space should be sufficient to accommodate all jobs allocated to a node, either running or suspended." So I'm wondering if 20% is enough, or whether it should scale by the number of single jobs I might be running at any one time. E.g. if I'm running 10 jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap? any thoughts? -ashton
Re: [slurm-users] swap size
Hi John! Thanks for the reply, lots to think about. In terms of suspending/resuming, my situation might be a bit different than other people. As I mentioned this is an install on a single node workstation. This is my daily office machine. I run alot of python processing scripts that have low CPU need but lots of iterations. I found it easier to manage these in slurm, opposed to writing mpi/parallel processing routines in python directly. Given this, sometimes I might submit a slurm array with 10K jobs, that might take a week to run, but I still need to sometimes do work during the day that requires more CPU power. In those cases I suspend the background array, crank through whatever I need to do and then resume in the evening when I go home. Sometimes I can say for jobs to finish, sometimes I have to break in the middle of running jobs On Fri, Sep 21, 2018, 10:07 PM John Hearns wrote: > Ashton, on a compute node with 256Gbytes of RAM I would not > configure any swap at all. None. > I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM - > and no swap. > Also our ICE clusters were diskless - SGI very smartly configured swap > over ISCSI - but we disabled this, the reason being that if one node > in a job starts swapping the likelihood is that all the nodes are > swapping, and things turn to treacle from there. > Also, as another issue, if you have lots of RAM you need to look at > the vm tunings for dirty ratio, background ratio and centisecs. Linux > will aggressively cache data which is written to disk - you can get a > situation where your processes THINK data is written to disk but it is > cached, then what happens of there is a power loss? SO get those > caches flushed often. > > https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ > > Oh, and my other tip. In the past vm.min_free_kbytes was ridiculously > small on default Linux systems. I call this the 'wriggle room' when a > system is short on RAM. Think of it like those square sliding letters > puzzles - min_free_kbytes is the empty square which permits the letter > tiles to move. > SO look at your min_free_kbytes and increase it (If I'm not wrong in > RH7 and Centos7 systems it is a reasonable value already) > https://bbs.archlinux.org/viewtopic.php?id=184655 > > Oh, and it is good to keep a terminal open with 'watch cat > /proc/meminfo' I have spent many a happy hour staring at that when > looking at NFS performance etc. etc. > > Back to your specific case. My point is that for HPC work you should > never go into swap (with a normally running process, ie no job > pre-emption). I find that 20 percent rule is out of date. Yes, > probably you should have some swap on a workstation. And yes disk > space is cheap these days. > > > However, you do talk about job pre-emption and suspending/resuming > jobs. I have never actually seen that being used in production. > At this point I would be grateful for some education from the choir - > is this commonly used and am I just hopelessly out of date? > Honestly, anywhere I have managed systems, lower priority jobs are > either allowed to finish, or in the case of F1 we checkpointed and > killed low priority jobs manually if there was a super high priority > job to run. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, 21 Sep 2018 at 22:34, A wrote: > > > > I have a single node slurm config on my workstation (18 cores, 256 gb > ram, 40 Tb disk space). I recently just extended the array size to its > current config and am reconfiguring my LVM logical volumes. > > > > I'm curious on people's thoughts on swap sizes for a node. Redhat these > days recommends up to 20% of ram size for swap size, but no less than 4 gb. > > > > But..according to slurm faq; > > "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT > signals respectively, so swap and disk space should be sufficient to > accommodate all jobs allocated to a node, either running or suspended." > > > > So I'm wondering if 20% is enough, or whether it should scale by the > number of single jobs I might be running at any one time. E.g. if I'm > running 10 jobs that all use 20 gb of ram, and I suspend, should I need 200 > gb of swap? > > > > any thoughts? > > > > -ashton > >
Re: [slurm-users] swap size
Ray I'm also on Ubuntu. I'll try the same test, but do it with and without swap on (e.g. by running the swapoff and swapon commands first). To complicate things I also don't know if the swapiness level makes a difference. Thanks Ashton On Sun, Sep 23, 2018, 7:48 AM Raymond Wan wrote: > > Hi Chris, > > > On Sunday, September 23, 2018 09:34 AM, Chris Samuel wrote: > > On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote: > > > >> SLURM's ability to suspend jobs must be storing the state in a > >> location outside of this 512 GB. So, you're not helping this by > >> allocating more swap. > > > > I don't believe that's the case. My understanding is that in this mode > it's > > just sending processes SIGSTOP and then launching the incoming job so you > > should really have enough swap for the previous job to get swapped out > to in > > order to free up RAM for the incoming job. > > > Hmm, I'm way out of my comfort zone but I am curious > about what happens. Unfortunately, I don't think I'm able > to read kernel code, but someone here > ( > https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel) > > seems to suggest that SIGSTOP and SIGCONT moves a process > between the runnable and waiting queues. > > I'm not sure if I did the correct test, but I wrote a C > program that allocated a lot of memory: > > - > #include > > #define memsize 16000 > > int main () { >char *foo = NULL; > >foo = (char *) malloc (sizeof (char) * memsize); > >for (int i = 0; i < memsize; i++) { > foo[i] = 0; >} > >do { >} while (1); > } > - > > Then, I ran it and sent a SIGSTOP to it. According to htop > (I don't know if it's correct), it seems to still be > occupying memory, but just not any CPU cycles. > > Perhaps I've done something wrong? I did read elsewhere > that how SIGSTOP is treated can vary from system to > system... I happen to be on an Ubuntu system. > > Ray > > > >
Re: [slurm-users] Priority wait
I'm guessing you should have sent them to cluster Decepticon, instead In all seriousness though, provide the conf file. You might have accidentally set a maximum number of running jobs somewhere On Nov 13, 2017 7:28 AM, "Benjamin Redling" wrote: > Hi Roy, > > On 11/13/17 2:37 PM, Roe Zohar wrote: > [...] > >> I sent 3000 jobs with feature Optimus and part are running while part are >> pendind. Which is ok. >> But I have sent 1000 jobs to Megatron and they are all in pending stating >> they wait because of priority. Whay os that? >> >> B.t.w if I change their priority to a higher one, they start to run on >> Megatron. >> > > my guess: is if you can provide the slurm.conf of that cluster, the > probability anyone will sacrifice his spare-time for you will increase > significantly. > > Regards, > Benjamin > -- > FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html > ☎ +49 3641 9 44323 > >
[slurm-users] Changing node weights in partitions
Dear all, I would like to create two partitions, A and B, in which node1 had a certain weight in partition A and a different one in partition B. Does anyone know how to implement it? Thanks very much for the help! Cheers, José
Re: [slurm-users] Changing node weights in partitions
Dear Ole, Thanks for your fast reply. I really appreciate that. I had a look at your website and googled about “weight masks” but still have some questions. From your example I see that the mask definition is commented out. How to define what the mask means? If helps, I’ll put an easy example. Node1 have more RAM and clock freq than node2. Partition A should start filling node1, while partition B should start filling node2. Can I accomplish this behavior through weighting the nodes? With your example I’m afraid to say it’s not still clear to me how. Thanks a lot for your help. José > On 22. Mar 2019, at 16:29, Ole Holm Nielsen > wrote: > >> On 3/22/19 4:15 PM, José A. wrote: >> Dear all, >> I would like to create two partitions, A and B, in which node1 had a certain >> weight in partition A and a different one in partition B. Does anyone know >> how to implement it? > > Some pointers to documentation of this and a practical example is in my Wiki > page: > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-weight > > /Ole >
Re: [slurm-users] Changing node weights in partitions
Hello Chris, You got my point. I want a way in which a partition influences the priority with a node takes new jobs. Any tip will be really appreciated. Thanks a lot. Cheers, José > On 23. Mar 2019, at 03:38, Chris Samuel wrote: > >> On 22/3/19 12:51 pm, Ole Holm Nielsen wrote: >> >> The web page explains how the weight mask is defined: Each digit in the mask >> defines a node property. Please read the example given. > > I don't think that's what José is asking for, he wants the weights for a node > to be different when being considered in one partition to when it's being > considered in a different partition. > > I don't think you can do that though I'm afraid, José, I think the weight is > only attached to the node and the partition doesn't influence it. > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA >
Re: [slurm-users] Changing node weights in partitions
Dear Ole, Thanks for the support. I think that could help me in the following way: 1. Setting different partitions with the node groups I want to prioritize. 2. Allowing users to submit to several partitions at the same time 3. Through accounting, creating accounts with different priorities from one partition to another. That will allow that each job type, associated to an account, starts differently in different partitions. 4. Once a job start in one partition, the other submitted jobs are killed and get out of SLURM. It’s a bit more work but gets the effect I am looking for: that different nodes prioritize different types of jobs. Is that, specially step 4, possible! Thanks for the help. José > On 24. Mar 2019, at 21:52, Ole Holm Nielsen > wrote: > > Hi José, > >> On 23-03-2019 19:59, Jose A wrote: >> You got my point. I want a way in which a partition influences the priority >> with a node takes new jobs. >> Any tip will be really appreciated. Thanks a lot. > > Would PriorityWeightPartition as defined with the Multifactor Priority Plugin > (https://slurm.schedmd.com/priority_multifactor.html) help you? > > See also my summary in > https://wiki.fysik.dtu.dk/niflheim/Slurm_scheduler#multifactor-priority-plugin-scheduler > > /Ole >
[slurm-users] SLURM in Virtual Machine
Dear all, In the expansion of our Cluster we are considering to install SLURM within a virtual machine in order to simplify updates and reconfigurations. Does any of your have experience running SLURM in VMs? I would really appreciate if you could share your ideas and experiences. Thanks a lot. Cheers José
Re: [slurm-users] SLURM in Virtual Machine
Dear all, thank you for your fast feedback. My initial idea was to run slurmctld and slurmdb in respective KVMs and running while keeping the worker nodes physical. From what I see that is a setup that works without problem. However, I also find interesting some of the suggestions that you mentioned, like having worker nodes in VMs for testing and compilation purposes or even the login node. I will give some thoughts about that. Thanks a lot for the support. You are great. -- José On 12. September 2019 at 19:46:36, Brian Andrus (toomuc...@gmail.com) wrote: Well, technically I have run several clusters all in VMs because it is all in the cloud. I think the main issue would be how resources are allocated and the need. Given the choice, I would not run nodes in VMs because the hypervisor inherently adds overhead that could be used for compute. However, there are definite use cases that make it worthwhile. So long as you allocate enough resources for the node (be it the controller or other) you will be fine. Brian Andrus On 9/12/2019 7:23 AM, Jose A wrote: > Dear all, > > In the expansion of our Cluster we are considering to install SLURM within a virtual machine in order to simplify updates and reconfigurations. > > Does any of your have experience running SLURM in VMs? I would really appreciate if you could share your ideas and experiences. > > Thanks a lot. > > Cheers > > José
[slurm-users] srun seg faults immediately from within sbatch but not salloc
Dear all, I am trying to set up a small cluster running slurm on Ubuntu 16.04. I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared partition. Installation seems fine. Munge is taken from the system package. Something like this: ./configure --prefix=/software/slurm/slurm-17.11.5 --exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix --with-munge=/usr --sysconfdir=/software/slurm/etc One of the nodes is also the control host and runs both slurmctld and slurmd (but the issue is there also if this is not the case). I start daemons manually at the moment (slurmctld first). My configuration file looks like this (I removed the node-specific parts): SlurmdUser=root # AuthType=auth/munge # Epilog=/usr/local/slurm/etc/epilog FastSchedule=1 JobCompLoc=/var/log/slurm/slurm.job.log JobCompType=jobcomp/filetxt JobCredentialPrivateKey=/usr/local/etc/slurm.key JobCredentialPublicCertificate=/usr/local/etc/slurm.cert #PluginDir=/usr/local/slurm/lib/slurm # Prolog=/usr/local/slurm/etc/prolog SchedulerType=sched/backfill SelectType=select/linear SlurmUser=cadmin # this user exists everywhere SlurmctldPort=7002 SlurmctldTimeout=300 SlurmdPort=7003 SlurmdTimeout=300 SwitchType=switch/none TreeWidth=50 # # logging StateSaveLocation=/var/log/slurm/tmp SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h # # job settings MaxTasksPerNode=64 MpiDefault=pmix_v2 # plugins TaskPlugin=task/cgroup There are no prolog or epilog scripts. After some fiddling with MPI, I got the system to work with interactive jobs through salloc (MPI behaves correctly for jobs occupying one or all of the nodes). sinfo produces expected results. However, as soon as I try to submit through sbatch I get an instantaneous seg fault regardless of executable (even when there is none specified, i.e., the srun command is meaningless). When I try to monitor slurmd in the foreground (- -D), I get something like this: slurmd: debug: Log file re-opened slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2 slurmd: Message aggregation disabled slurmd: topology NONE plugin loaded slurmd: route default plugin loaded slurmd: CPU frequency setting not configured for this node slurmd: debug: Resource spec: No specialized cores configured by default on this node slurmd: debug: Resource spec: Reserved system memory limit not configured for this node slurmd: debug: Reading cgroup.conf file /software/slurm/etc/cgroup.conf slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2 slurmd: debug: Reading cgroup.conf file /software/slurm/etc/cgroup.conf slurmd: debug: task/cgroup: loaded slurmd: debug: Munge authentication plugin loaded slurmd: debug: spank: opening plugin stack /software/slurm/etc/plugstack.conf slurmd: Munge cryptographic signature plugin loaded slurmd: slurmd version 17.11.5 started slurmd: debug: Job accounting gather NOT_INVOKED plugin loaded slurmd: debug: job_container none plugin loaded slurmd: debug: switch NONE plugin loaded slurmd: slurmd started on Mon, 07 May 2018 23:54:31 +0200 slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=64062 TmpDisk=187611 Uptime=1827335 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) slurmd: debug: AcctGatherEnergy NONE plugin loaded slurmd: debug: AcctGatherProfile NONE plugin loaded slurmd: debug: AcctGatherInterconnect NONE plugin loaded slurmd: debug: AcctGatherFilesystem NONE plugin loaded slurmd: debug2: No acct_gather.conf file (/software/slurm/etc/acct_gather.conf) slurmd: debug2: got this type of message 4005 slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH slurmd: debug2: _group_cache_lookup_internal: no entry found for andreas slurmd: _run_prolog: run job script took usec=5 slurmd: _run_prolog: prolog with lock for job 100 ran for 0 seconds slurmd: Launching batch job 100 for UID 1003 slurmd: debug2: got this type of message 6011 slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB slurmd: debug: _rpc_terminate_job, uid = 1001 slurmd: debug: credential for job 100 revoked slurmd: debug2: No steps in jobid 100 to send signal 999 slurmd: debug2: No steps in jobid 100 to send signal 18 slurmd: debug2: No steps in jobid 100 to send signal 15 slurmd: debug2: set revoke expiration for jobid 100 to 1525730207 UTS slurmd: debug2: got this type of message 1008 Here, job 100 would be a submission script with something like: #!/bin/bash -l #SBATCH --job-name=FSPMXX #SBATCH --output=/storage/andreas/camp3.out #SBATCH --error=/storage/andreas/camp3.err #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 --tasks-per-node=32 --ntasks-per-core=1 #SBATCH
Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc
Dear all, I tried to debug this with some apparent success (for now). If anyone cares: With the help of gdb inside sbatch, I tracked down the immediate seg fault to strcmp. I then hacked src/srun/srun.c with some info statements and isolated this function as the culprit: static void _setup_env_working_cluster(void) With my configuration, this routine ended up performing a strcmp of two NULL pointers, which seg-faults on our system (and is not language-compliant I would think?). My current understanding is that this is a slurm bug. The issue is rectifiable by simply giving the cluster a name in slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw. Hope this helps, Andreas -"slurm-users" wrote: - To: slurm-users@lists.schedmd.com From: a.vita...@bioc.uzh.ch Sent by: "slurm-users" Date: 05/08/2018 12:44AM Subject: [slurm-users] srun seg faults immediately from within sbatch but not salloc Dear all, I am trying to set up a small cluster running slurm on Ubuntu 16.04. I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared partition. Installation seems fine. Munge is taken from the system package. Something like this: ./configure --prefix=/software/slurm/slurm-17.11.5 --exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix --with-munge=/usr --sysconfdir=/software/slurm/etc One of the nodes is also the control host and runs both slurmctld and slurmd (but the issue is there also if this is not the case). I start daemons manually at the moment (slurmctld first). My configuration file looks like this (I removed the node-specific parts): SlurmdUser=root # AuthType=auth/munge # Epilog=/usr/local/slurm/etc/epilog FastSchedule=1 JobCompLoc=/var/log/slurm/slurm.job.log JobCompType=jobcomp/filetxt JobCredentialPrivateKey=/usr/local/etc/slurm.key JobCredentialPublicCertificate=/usr/local/etc/slurm.cert #PluginDir=/usr/local/slurm/lib/slurm # Prolog=/usr/local/slurm/etc/prolog SchedulerType=sched/backfill SelectType=select/linear SlurmUser=cadmin # this user exists everywhere SlurmctldPort=7002 SlurmctldTimeout=300 SlurmdPort=7003 SlurmdTimeout=300 SwitchType=switch/none TreeWidth=50 # # logging StateSaveLocation=/var/log/slurm/tmp SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h # # job settings MaxTasksPerNode=64 MpiDefault=pmix_v2 # plugins TaskPlugin=task/cgroup There are no prolog or epilog scripts. After some fiddling with MPI, I got the system to work with interactive jobs through salloc (MPI behaves correctly for jobs occupying one or all of the nodes). sinfo produces expected results. However, as soon as I try to submit through sbatch I get an instantaneous seg fault regardless of executable (even when there is none specified, i.e., the srun command is meaningless). When I try to monitor slurmd in the foreground (- -D), I get something like this: slurmd: debug: Log file re-opened slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2 slurmd: Message aggregation disabled slurmd: topology NONE plugin loaded slurmd: route default plugin loaded slurmd: CPU frequency setting not configured for this node slurmd: debug: Resource spec: No specialized cores configured by default on this node slurmd: debug: Resource spec: Reserved system memory limit not configured for this node slurmd: debug: Reading cgroup.conf file /software/slurm/etc/cgroup.conf slurmd: debug2: hwloc_topology_init slurmd: debug2: hwloc_topology_load slurmd: debug: CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2 slurmd: debug: Reading cgroup.conf file /software/slurm/etc/cgroup.conf slurmd: debug: task/cgroup: loaded slurmd: debug: Munge authentication plugin loaded slurmd: debug: spank: opening plugin stack /software/slurm/etc/plugstack.conf slurmd: Munge cryptographic signature plugin loaded slurmd: slurmd version 17.11.5 started slurmd: debug: Job accounting gather NOT_INVOKED plugin loaded slurmd: debug: job_container none plugin loaded slurmd: debug: switch NONE plugin loaded slurmd: slurmd started on Mon, 07 May 2018 23:54:31 +0200 slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=64062 TmpDisk=187611 Uptime=1827335 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null) slurmd: debug: AcctGatherEnergy NONE plugin loaded slurmd: debug: AcctGatherProfile NONE plugin loaded slurmd: debug: AcctGatherInterconnect NONE plugin loaded slurmd: debug: AcctGatherFilesystem NONE plugin loaded slurmd: debug2: No acct_gather.conf file (/software/slurm/etc/acct_gather.conf) slurmd: debug2: got this type of message 4005 slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH slurmd: debug2: _group_cache_lookup_internal: no ent
Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc
Hi Benjamin, thanks for getting back to me! I somehow failed to ever arrive at this page. Andreas -"slurm-users" wrote: - To: slurm-users@lists.schedmd.com From: Benjamin Matthews Sent by: "slurm-users" Date: 05/09/2018 01:20AM Subject: Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc I think this should already be fixed in the upcoming release. See: https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72 On 5/8/18 12:08 PM, a.vita...@bioc.uzh.ch wrote: Dear all, I tried to debug this with some apparent success (for now). If anyone cares: With the help of gdb inside sbatch, I tracked down the immediate seg fault to strcmp. I then hacked src/srun/srun.c with some info statements and isolated this function as the culprit: static void _setup_env_working_cluster(void) With my configuration, this routine ended up performing a strcmp of two NULL pointers, which seg-faults on our system (and is not language-compliant I would think?). My current understanding is that this is a slurm bug. The issue is rectifiable by simply giving the cluster a name in slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw. Hope this helps, Andreas -"slurm-users" wrote: - To: slurm-users@lists.schedmd.com From: a.vita...@bioc.uzh.ch Sent by: "slurm-users" Date: 05/08/2018 12:44AM Subject: [slurm-users] srun seg faults immediately from within sbatch but not salloc Dear all, I am trying to set up a small cluster running slurm on Ubuntu 16.04. I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared partition. Installation seems fine. Munge is taken from the system package. Something like this: ./configure --prefix=/software/slurm/slurm-17.11.5 --exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix --with-munge=/usr --sysconfdir=/software/slurm/etc One of the nodes is also the control host and runs both slurmctld and slurmd (but the issue is there also if this is not the case). I start daemons manually at the moment (slurmctld first). My configuration file looks like this (I removed the node-specific parts): SlurmdUser=root # AuthType=auth/munge # Epilog=/usr/local/slurm/etc/epilog FastSchedule=1 JobCompLoc=/var/log/slurm/slurm.job.log JobCompType=jobcomp/filetxt JobCredentialPrivateKey=/usr/local/etc/slurm.key JobCredentialPublicCertificate=/usr/local/etc/slurm.cert #PluginDir=/usr/local/slurm/lib/slurm # Prolog=/usr/local/slurm/etc/prolog SchedulerType=sched/backfill SelectType=select/linear SlurmUser=cadmin # this user exists everywhere SlurmctldPort=7002 SlurmctldTimeout=300 SlurmdPort=7003 SlurmdTimeout=300 SwitchType=switch/none TreeWidth=50 # # logging StateSaveLocation=/var/log/slurm/tmp SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h # # job settings MaxTasksPerNode=64 MpiDefault=pmix_v2 # plugins TaskPlugin=task/cgroup There are no prolog or epilog scripts. After some fiddling with MPI, I got the system to work with interactive jobs through salloc (MPI behaves correctly for jobs occupying one or all of the nodes). sinfo produces expected results. However, as soon as I try to submit through sbatch I get an instantaneous seg fault regardless o
[slurm-users] Seff error with Slurm-18.08.1
Hi all I have updated my slurm from the 17.11.0 version to the 18.08.1. With the previous version, the 17.11.0 version, the seff tool was working fine but with the 18.08.1 version, when I try to run the seff tool I receive the next error message: # ./seff perl: error: plugin_load_from_file: dlopen(/usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so): /usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so: undefined symbol: node_record_count perl: error: Couldn't load specified plugin name for accounting_storage/slurmdbd: Dlopen of plugin file failed perl: error: cannot create accounting_storage context for accounting_storage/slurmdbd perl: error: plugin_load_from_file: dlopen(/usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so): /usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so: undefined symbol: node_record_count perl: error: Couldn't load specified plugin name for accounting_storage/slurmdbd: Dlopen of plugin file failed perl: error: cannot create accounting_storage context for accounting_storage/slurmdbd Job not found. # Both Slurm installations has been compiled from sources in the same computer but only the seff that was compiled in the 17.11.0 version works fine. To compile the seff tool, from the source Slurm tree: cd contrib make make install I think the problem is in the perlapi. Could it be a bug? Any Idea about how can I fix this problem? Thanks a lot. -- Miguel A. Sánchez Gómez System Administrator Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) Barcelona Biomedical Research Park (office 4.80) Doctor Aiguader 88 | 08003 Barcelona (Spain) Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 e-mail: miguelangel.sanc...@upf.edu
Re: [slurm-users] Seff error with Slurm-18.08.1
Hi and thanks for all your answers and sorry for the delay in my answer. Yesterday I have installed in the controller machine the Slurm-18.08.3 to check if with this last release the Seff command is working fine. The behavior has improve but I still receive a error message: # /usr/local/slurm-18.08.3/bin/seff 1694112 *Use of uninitialized value $lmem in numeric lt (<) at /usr/local/slurm-18.08.3/bin/seff line 130, line 624.* Job ID: 1694112 Cluster: X User/Group: X State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 01:39:33 CPU Efficiency: 4266.43% of 00:02:20 core-walltime Job Wall-clock time: 00:01:10 Memory Utilized: 0.00 MB (estimated maximum) Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node) [root@hydra ~]# And due to this problem, any job shows me as memory utilized the value of 0.00 MB. With slurm-17.11.1 is working fine: # /usr/local/slurm-17.11.0/bin/seff 1694112 Job ID: 1694112 Cluster: X User/Group: X State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 2 CPU Utilized: 01:39:33 CPU Efficiency: 4266.43% of 00:02:20 core-walltime Job Wall-clock time: 00:01:10 Memory Utilized: 2.44 GB Memory Efficiency: 62.57% of 3.91 GB [root@hydra bin]# Miguel A. Sánchez Gómez System Administrator Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) Barcelona Biomedical Research Park (office 4.80) Doctor Aiguader 88 | 08003 Barcelona (Spain) Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 e-mail: miguelangel.sanc...@upf.edu On 11/06/2018 06:30 PM, Mike Cammilleri wrote: > > Thanks for this. We'll try the workaround script. It is not > mission-critical but our users have gotten accustomed to seeing these > metrics at the end of each run and its nice to have. We are currently > doing this in a test VM environment, so by the time we actually do the > upgrade to the cluster perhaps the fix will be available then. > > > Mike Cammilleri > > Systems Administrator > > Department of Statistics | UW-Madison > > 1300 University Ave | Room 1280 > 608-263-6673 | mi...@stat.wisc.edu > > > > > *From:* slurm-users on behalf > of Chris Samuel > *Sent:* Tuesday, November 6, 2018 5:03 AM > *To:* slurm-users@lists.schedmd.com > *Subject:* Re: [slurm-users] Seff error with Slurm-18.08.1 > > On 6/11/18 7:49 pm, Baker D.J. wrote: > > > The good new is that I am assured by SchedMD that the bug has been > fixed > > in v18.08.3. > > Looks like it's fixed in this commmit. > > commit 3d85c8f9240542d9e6dfb727244e75e449430aac > Author: Danny Auble > Date: Wed Oct 24 14:10:12 2018 -0600 > > Handle symbol resolution errors in the 18.08 slurmdbd. > > Caused by b1ff43429f6426c when moving the slurmdbd agent internals. > > Bug 5882. > > > > Having said that we will probably live with this issue > > rather than disrupt users with another upgrade so soon . > > An upgrade to 18.08.3 from 18.08.1 shouldn't be disruptive though, > should it? We just flip a symlink and the users see the new binaries, > libraries, etc immediately, we can then restart daemons as and when we > need to (in the right order of course, slurmdbd, slurmctld and then > slurmd's). > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC >
Re: [slurm-users] Seff error with Slurm-18.08.1
Oh, thanks Paddy for your patch, it works very well !! Miguel A. Sánchez Gómez System Administrator Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) Barcelona Biomedical Research Park (office 4.80) Doctor Aiguader 88 | 08003 Barcelona (Spain) Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 e-mail: miguelangel.sanc...@upf.edu On 11/09/2018 07:59 AM, Marcus Wagner wrote: > Thanks Paddy, > > just something learned again ;) > > > Best > Marcus > > On 11/08/2018 05:07 PM, Paddy Doyle wrote: >> Hi all, >> >> It looks like we can use the api to avoid having to manually parse >> the '2=' >> value from the stats{tres_usage_in_max} value. >> >> I've submitted a bug report and patch: >> >> https://bugs.schedmd.com/show_bug.cgi?id=6004 >> >> The minimal changes needed would be in the attched seff.patch. >> >> Hope that helps, >> >> Paddy >> >> On Thu, Nov 08, 2018 at 11:54:59AM +0100, Marcus Wagner wrote: >> >>> Hi Miguel, >>> >>> >>> this is because SchedMD changed the stats field. There exists no more >>> rss_max, cmp. line 225 of seff. >>> You need to evaluate the field stats{tres_usage_in_max}, and there >>> the value >>> after '2=', but this is the memory value in bytes instead of kbytes, >>> so this >>> should be divided by 1024 additionally. >>> >>> >>> Best >>> Marcus >>> >>> On 11/08/2018 11:06 AM, Miguel A. Sánchez wrote: >>>> Hi and thanks for all your answers and sorry for the delay in my >>>> answer. >>>> Yesterday I have installed in the controller machine the Slurm-18.08.3 >>>> to check if with this last release the Seff command is working >>>> fine. The >>>> behavior has improve but I still receive a error message: >>>> >>>> >>>> # /usr/local/slurm-18.08.3/bin/seff 1694112 >>>> *Use of uninitialized value $lmem in numeric lt (<) at >>>> /usr/local/slurm-18.08.3/bin/seff line 130, line 624.* >>>> Job ID: 1694112 >>>> Cluster: X >>>> User/Group: X >>>> State: COMPLETED (exit code 0) >>>> Nodes: 1 >>>> Cores per node: 2 >>>> CPU Utilized: 01:39:33 >>>> CPU Efficiency: 4266.43% of 00:02:20 core-walltime >>>> Job Wall-clock time: 00:01:10 >>>> Memory Utilized: 0.00 MB (estimated maximum) >>>> Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node) >>>> [root@hydra ~]# >>>> >>>> >>>> And due to this problem, any job shows me as memory utilized the >>>> value >>>> of 0.00 MB. >>>> >>>> >>>> With slurm-17.11.1 is working fine: >>>> >>>> >>>> # /usr/local/slurm-17.11.0/bin/seff 1694112 >>>> Job ID: 1694112 >>>> Cluster: X >>>> User/Group: X >>>> State: COMPLETED (exit code 0) >>>> Nodes: 1 >>>> Cores per node: 2 >>>> CPU Utilized: 01:39:33 >>>> CPU Efficiency: 4266.43% of 00:02:20 core-walltime >>>> Job Wall-clock time: 00:01:10 >>>> Memory Utilized: 2.44 GB >>>> Memory Efficiency: 62.57% of 3.91 GB >>>> [root@hydra bin]# >>>> >>>> >>>> >>>> >>>> Miguel A. Sánchez Gómez >>>> System Administrator >>>> Research Programme on Biomedical Informatics - GRIB (IMIM-UPF) >>>> >>>> Barcelona Biomedical Research Park (office 4.80) >>>> Doctor Aiguader 88 | 08003 Barcelona (Spain) >>>> Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550 >>>> e-mail:miguelangel.sanc...@upf.edu >>>> On 11/06/2018 06:30 PM, Mike Cammilleri wrote: >>>>> Thanks for this. We'll try the workaround script. It is not >>>>> mission-critical but our users have gotten accustomed to seeing >>>>> these metrics at the end of each run and its nice to have. We are >>>>> currently doing this in a test VM environment, so by the time we >>>>> actually do the upgrade to the cluster perhaps the fix will be >>>>> available then. >>>>> >>>>> >>>>> Mike Cammilleri >>>>> >>>>> Systems Administrator >>>>> >>>>> Department of Statistics | UW-Madison >>>>> >>>>> 1300 University Ave | Room 1280 >>>>> 608-26
[slurm-users] slurm password -what is impact when changing it
Is there any issue if I set/change the slurm account password?I'm running 19.05.x Current state is locked but I have to reset it periodically: # passwd --status slurm slurm LK 2014-02-03 -1 -1 -1 -1 (Password locked.) Best Regards, RB
[slurm-users] Memory per CPU
I am working on my first ever SLURM cluster build for use as a resource manager in a JupyterHub Development environment. I have configured the cluster for SelectType of 'select/con_res' with DefMemPerCPU and MaxMemPerCPU of 16Gb. The idea is to essentially provide for jobs that run in a 1 CPU/16Gb chunks. This is a starting point for us. What I am seeing is that when users submit jobs and ask for memory only - in this case, 16Gb, SLURM actually allocates 2 CPUs, not 1 that I would expect. Is my understanding of how this particular configuration works incorrect? The contents of this email are the property of PNC. If it was not addressed to you, you have no legal right to read it. If you think you received it in error, please notify the sender. Do not forward or copy without permission of the sender. This message may be considered a commercial electronic message under Canadian law or this message may contain an advertisement of a product or service and thus may constitute a commercial electronic mail message under US law. You may unsubscribe at any time from receiving commercial electronic messages from PNC at http://pages.e.pnc.com/globalunsub/ PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com
Re: [slurm-users] EXTERNAL: Re: Memory per CPU
The following is the pertinent information for our cluster and the job run. Note: server names, IP addresses and user IDs are anonymized. Slurm.conf == TaskPlugin=task/affinity # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory # Memory Management DefMemPerCPU=16384 MaxMemPerCPU=16384 NodeName=linuxnode1 NodeAddr=99.999.999.999 CPUs=4 RealMemory=49152 State=UNKNOWN NodeName=linuxnode2 NodeAddr=99.999.999.999 CPUs=4 RealMemory=49152 State=UNKNOWN PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP Job SBATCH file #!/bin/bash #SBATCH --job-name=HadoopTest # Job name #SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=** # Where to send mail #SBATCH --mem=16gb # Job memory request #SBATCH --time=08:00:00 # Time limit hrs:min:sec #SBATCH --output=logs/slurm_test_%j.log # Standard output and error log pwd; hostname; date echo "Running sbatch-HadoopTest script" kinit cd /projects python HiveValidation.py python ImpalaValidation.py python SparkTest.py date scontrol output === JobId=334 JobName=HadoopTest UserId=** GroupId=** MCS_label=N/A Priority=4294901604 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:06 TimeLimit=08:00:00 TimeMin=N/A SubmitTime=2020-09-29T10:40:09 EligibleTime=2020-09-29T10:40:09 AccrueTime=2020-09-29T10:40:09 StartTime=2020-09-29T10:40:10 EndTime=2020-09-29T18:40:10 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-29T10:40:10 Partition=debug AllocNode:Sid=lpae138a:41279 ReqNodeList=(null) ExcNodeList=(null) NodeList=linuxnode2 BatchHost=linuxnode2 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=16G,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=2 MinMemoryNode=16G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/projects/sbatch-HadoopTest.sh WorkDir=/projects StdErr=/projects/logs/slurm_test_334.log StdIn=/dev/null StdOut=/projects/logs/slurm_test_334.log Power= MailUser=** MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT -Original Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Michael Di Domenico Sent: Tuesday, September 29, 2020 10:20 AM To: Slurm User Community List Subject: EXTERNAL: Re: [slurm-users] Memory per CPU ** This email has been received from outside the organization – Think before clicking on links, opening attachments, or responding. ** what leads you to believe that you're getting 2 CPU's instead of 1? 'scontrol show job ' would be a helpful first start. On Tue, Sep 29, 2020 at 9:56 AM Luecht, Jeff A wrote: > > I am working on my first ever SLURM cluster build for use as a resource > manager in a JupyterHub Development environment. I have configured the > cluster for SelectType of ‘select/con_res’ with DefMemPerCPU and MaxMemPerCPU > of 16Gb. The idea is to essentially provide for jobs that run in a 1 > CPU/16Gb chunks. This is a starting point for us. > > > > What I am seeing is that when users submit jobs and ask for memory only – in > this case, 16Gb, SLURM actually allocates 2 CPUs, not 1 that I would expect. > Is my understanding of how this particular configuration works incorrect? > > > The contents of this email are the property of PNC. If it was not addressed > to you, you have no legal right to read it. If you think you received it in > error, please notify the sender. Do not forward or copy without permission of > the sender. This message may be considered a commercial electronic message > under Canadian law or this message may contain an advertisement of a product > or service and thus may constitute a commercial electronic mail message under > US law. You may unsubscribe at any time from receiving commercial electronic > messages from PNC at http://pages.e.pnc.com/globalunsub/ > PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com > The contents of this email are the property of PNC. If it was not addressed to you, you have no legal right to read it. If you think you received it in error, please notify the sender. Do not forward or copy without permission of the sender. This message may be considered a commercial electronic message under Canadian law or this message may contain an advertisement of
Re: [slurm-users] EXTERNAL: Re: Memory per CPU
There are three pieces of information that may provide Useful: 1 - these are VMs and not physical servers 2 - the OS is RedHat 7.8 2 - As far as I can tell, hyperthreading is not enabled, but will check for sure 3 - when we ask for 15Gb memory - we will only get 1 CPU -Original Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Michael Di Domenico Sent: Tuesday, September 29, 2020 10:20 AM To: Slurm User Community List Subject: EXTERNAL: Re: [slurm-users] Memory per CPU ** This email has been received from outside the organization – Think before clicking on links, opening attachments, or responding. ** what leads you to believe that you're getting 2 CPU's instead of 1? 'scontrol show job ' would be a helpful first start. On Tue, Sep 29, 2020 at 9:56 AM Luecht, Jeff A wrote: > > I am working on my first ever SLURM cluster build for use as a resource > manager in a JupyterHub Development environment. I have configured the > cluster for SelectType of ‘select/con_res’ with DefMemPerCPU and MaxMemPerCPU > of 16Gb. The idea is to essentially provide for jobs that run in a 1 > CPU/16Gb chunks. This is a starting point for us. > > > > What I am seeing is that when users submit jobs and ask for memory only – in > this case, 16Gb, SLURM actually allocates 2 CPUs, not 1 that I would expect. > Is my understanding of how this particular configuration works incorrect? > > > The contents of this email are the property of PNC. If it was not addressed > to you, you have no legal right to read it. If you think you received it in > error, please notify the sender. Do not forward or copy without permission of > the sender. This message may be considered a commercial electronic message > under Canadian law or this message may contain an advertisement of a product > or service and thus may constitute a commercial electronic mail message under > US law. You may unsubscribe at any time from receiving commercial electronic > messages from PNC at http://pages.e.pnc.com/globalunsub/ > PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com > The contents of this email are the property of PNC. If it was not addressed to you, you have no legal right to read it. If you think you received it in error, please notify the sender. Do not forward or copy without permission of the sender. This message may be considered a commercial electronic message under Canadian law or this message may contain an advertisement of a product or service and thus may constitute a commercial electronic mail message under US law. You may unsubscribe at any time from receiving commercial electronic messages from PNC at http://pages.e.pnc.com/globalunsub/ PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com
Re: [slurm-users] EXTERNAL: Re: Memory per CPU
Here are the particulars asked for. The following is the pertinent information for our cluster and the job run. Note: server names, IP addresses and user IDs are anonymized. Slurm.conf == TaskPlugin=task/affinity # SCHEDULING SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory # Memory Management DefMemPerCPU=16384 MaxMemPerCPU=16384 NodeName=linuxnode1 NodeAddr=99.999.999.999 CPUs=4 RealMemory=49152 State=UNKNOWN NodeName=linuxnode2 NodeAddr=99.999.999.999 CPUs=4 RealMemory=49152 State=UNKNOWN PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP Job SBATCH file #!/bin/bash #SBATCH --job-name=HadoopTest # Job name #SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=** # Where to send mail #SBATCH --mem=16gb # Job memory request #SBATCH --time=08:00:00 # Time limit hrs:min:sec #SBATCH --output=logs/slurm_test_%j.log # Standard output and error log pwd; hostname; date echo "Running sbatch-HadoopTest script" kinit cd /projects python HiveValidation.py python ImpalaValidation.py python SparkTest.py date scontrol output === JobId=334 JobName=HadoopTest UserId=** GroupId=** MCS_label=N/A Priority=4294901604 Nice=0 Account=(null) QOS=(null) JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:06 TimeLimit=08:00:00 TimeMin=N/A SubmitTime=2020-09-29T10:40:09 EligibleTime=2020-09-29T10:40:09 AccrueTime=2020-09-29T10:40:09 StartTime=2020-09-29T10:40:10 EndTime=2020-09-29T18:40:10 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-29T10:40:10 Partition=debug AllocNode:Sid=lpae138a:41279 ReqNodeList=(null) ExcNodeList=(null) NodeList=linuxnode2 BatchHost=linuxnode2 NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=2,mem=16G,node=1,billing=2 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=2 MinMemoryNode=16G MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/projects/sbatch-HadoopTest.sh WorkDir=/projects StdErr=/projects/logs/slurm_test_334.log StdIn=/dev/null StdOut=/projects/logs/slurm_test_334.log Power= MailUser=** MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT -Original Message- From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf Of Michael Di Domenico Sent: Tuesday, September 29, 2020 10:20 AM To: Slurm User Community List Subject: EXTERNAL: Re: [slurm-users] Memory per CPU ** This email has been received from outside the organization – Think before clicking on links, opening attachments, or responding. ** what leads you to believe that you're getting 2 CPU's instead of 1? 'scontrol show job ' would be a helpful first start. On Tue, Sep 29, 2020 at 9:56 AM Luecht, Jeff A wrote: > > I am working on my first ever SLURM cluster build for use as a resource > manager in a JupyterHub Development environment. I have configured the > cluster for SelectType of ‘select/con_res’ with DefMemPerCPU and MaxMemPerCPU > of 16Gb. The idea is to essentially provide for jobs that run in a 1 > CPU/16Gb chunks. This is a starting point for us. > > > > What I am seeing is that when users submit jobs and ask for memory only – in > this case, 16Gb, SLURM actually allocates 2 CPUs, not 1 that I would expect. > Is my understanding of how this particular configuration works incorrect? > > > The contents of this email are the property of PNC. If it was not addressed > to you, you have no legal right to read it. If you think you received it in > error, please notify the sender. Do not forward or copy without permission of > the sender. This message may be considered a commercial electronic message > under Canadian law or this message may contain an advertisement of a product > or service and thus may constitute a commercial electronic mail message under > US law. You may unsubscribe at any time from receiving commercial electronic > messages from PNC at http://pages.e.pnc.com/globalunsub/ > PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com > The contents of this email are the property of PNC. If it was not addressed to you, you have no legal right to read it. If you think you received it in error, please notify the sender. Do not forward or copy without permission of the sender. This message may be considered a commercial electronic message under Canadian law or t
Re: [slurm-users] EXTERNAL: Re: Memory per CPU
First off, I want to thank everyone for their input and suggestions. They were very helpful an ultimately pointed me in the right direction. I spent several hours playing around with various settings. Some additional background. When the srun command is used to execute this job, we do not see this issue. We only see it in SBATCH. What I ultimate did was the following: 1 - Change the NodeName to add the specific parameters Sockets, Cores and Threads. 2 - Changed the DefMemPerCPU/MaxMemCPU to 16144/12228 instead of 6000/12000 respectively I tested jobs after the above changes and used 'scontrol --defaults job ' command. The CPU allocation now works as expected. I do have one question though - what is the benefit/recommendation of using srun to execute a process within SBATCH. We are running primarily python jobs, but need to also support R jobs. -Original Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Diego Zuccato Sent: Wednesday, September 30, 2020 2:18 AM To: Slurm User Community List ; Michael Di Domenico Subject: EXTERNAL: Re: [slurm-users] Memory per CPU ** This email has been received from outside the organization – Think before clicking on links, opening attachments, or responding. ** Il 29/09/20 16:19, Michael Di Domenico ha scritto: > what leads you to believe that you're getting 2 CPU's instead of 1? I think I saw that too, once, but thought it was related to hyperthreading. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 The contents of this email are the property of PNC. If it was not addressed to you, you have no legal right to read it. If you think you received it in error, please notify the sender. Do not forward or copy without permission of the sender. This message may be considered a commercial electronic message under Canadian law or this message may contain an advertisement of a product or service and thus may constitute a commercial electronic mail message under US law. You may unsubscribe at any time from receiving commercial electronic messages from PNC at http://pages.e.pnc.com/globalunsub/ PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com
Re: [slurm-users] EXTERNAL: Re: Memory per CPU
So just to confirm, there is not inherent issue using srun within an SBATCH file? From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Ryan Novosielski Sent: Wednesday, September 30, 2020 10:01 AM To: Slurm User Community List Subject: Re: [slurm-users] EXTERNAL: Re: Memory per CPU ** This email has been received from outside the organization – Think before clicking on links, opening attachments, or responding. ** Primary one I’m aware of is that resource use is better reported (or at all in some cases) via srun, and srun can take care of MPI for an MPI job. I’m sure there are others as well (I guess avoiding another place where you have to describe the resources to be used and making sure they match, in the case of mpirun, etc.). -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu<mailto:novos...@rutgers.edu> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Sep 30, 2020, at 09:38, Luecht, Jeff A mailto:jeff.lue...@pnc.com>> wrote: First off, I want to thank everyone for their input and suggestions. They were very helpful an ultimately pointed me in the right direction. I spent several hours playing around with various settings. Some additional background. When the srun command is used to execute this job, we do not see this issue. We only see it in SBATCH. What I ultimate did was the following: 1 - Change the NodeName to add the specific parameters Sockets, Cores and Threads. 2 - Changed the DefMemPerCPU/MaxMemCPU to 16144/12228 instead of 6000/12000 respectively I tested jobs after the above changes and used 'scontrol --defaults job ' command. The CPU allocation now works as expected. I do have one question though - what is the benefit/recommendation of using srun to execute a process within SBATCH. We are running primarily python jobs, but need to also support R jobs. -Original Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Diego Zuccato Sent: Wednesday, September 30, 2020 2:18 AM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>>; Michael Di Domenico mailto:mdidomeni...@gmail.com>> Subject: EXTERNAL: Re: [slurm-users] Memory per CPU ** This email has been received from outside the organization – Think before clicking on links, opening attachments, or responding. ** Il 29/09/20 16:19, Michael Di Domenico ha scritto: what leads you to believe that you're getting 2 CPU's instead of 1? I think I saw that too, once, but thought it was related to hyperthreading. -- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786 The contents of this email are the property of PNC. If it was not addressed to you, you have no legal right to read it. If you think you received it in error, please notify the sender. Do not forward or copy without permission of the sender. This message may be considered a commercial electronic message under Canadian law or this message may contain an advertisement of a product or service and thus may constitute a commercial electronic mail message under US law. You may unsubscribe at any time from receiving commercial electronic messages from PNC at http://pages.e.pnc.com/globalunsub/ PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com The contents of this email are the property of PNC. If it was not addressed to you, you have no legal right to read it. If you think you received it in error, please notify the sender. Do not forward or copy without permission of the sender. This message may be considered a commercial electronic message under Canadian law or this message may contain an advertisement of a product or service and thus may constitute a commercial electronic mail message under US law. You may unsubscribe at any time from receiving commercial electronic messages from PNC at http://pages.e.pnc.com/globalunsub/ PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com
Re: [slurm-users] Quick hold on all partitions, all jobs
In your situation, where you're blocking user access to the login node, it probably doesn't matter. We use DOWN in most events, as INACTIVE would prevent new jobs from being queued against the partition at all. DOWN allows the jobs to be queued, and just doesn't permit them to run. (In either case, HOLDing PENDING jobs is redundant.) ~jonathon From: slurm-users on behalf of Lachlan Musicman Sent: Wednesday, November 8, 2017 5:00:12 PM To: Slurm User Community List Subject: [slurm-users] Quick hold on all partitions, all jobs The IT team sent an email saying "complete network wide network outage tomorrow night from 10pm across the whole institute". Our plan is to put all queued jobs on hold, suspend all running jobs, and turning off the login node. I've just discovered that the partitions have a state, and it can be set to UP, DOWN, DRAIN or INACTIVE. In this situation - most likely a 4 hour outage with nothing else affected - would you mark your partitions DOWN or INACTIVE? Ostensibly all users should be off the systems (because no network), but there's always one that sets an at or cron job or finds that corner case. Cheers L. -- "The antidote to apocalypticism is apocalyptic civics. Apocalyptic civics is the insistence that we cannot ignore the truth, nor should we panic about it. It is a shared consciousness that our institutions have failed and our ecosystem is collapsing, yet we are still here — and we are creative agents who can shape our destinies. Apocalyptic civics is the conviction that the only way out is through, and the only way through is together. " Greg Bloom @greggish https://twitter.com/greggish/status/873177525903609857
[slurm-users] slurm.spec-legacy - how to invoke
Can someone provide an example of using the rpmbuild command while specifying the slurm.spec-legacy file? I need to build the new version of slurm for RHEL6 and need to invoke the slurm.spec-legacy file (if possible) on this command line: # rpmbuild -tb slurm-17.11.1.tar.bz2 Regards, Ruth R. Braun Sr IT Analyst HPC
Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation
We put SSSD caches on a RAMDISK which helped a little bit with performance. - On 22 Jan, 2018, at 02:38, Alessandro Federico a.feder...@cineca.it wrote: | Hi John, | | just an update... | we not have a solution for the SSSD issue yet, but we changed the ACL | on the 2 partitions from AllowGroups=g2 to AllowAccounts=g2 and the | slowdown has gone. | | Thanks for the help | ale | | - Original Message - |> From: "Alessandro Federico" |> To: "John DeSantis" |> Cc: hpc-sysmgt-i...@cineca.it, "Slurm User Community List" |> , "Isabella Baccarelli" |> |> Sent: Wednesday, January 17, 2018 5:41:54 PM |> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv |> operation |> |> Hi John |> |> thanks for the infos. |> We are investigating the slowdown of sssd and I found some bug |> reports regarding slow sssd query |> with almost the same backtrace. Hopefully an update of sssd could |> solve this issue. |> |> We'll let you know if we found a solution. |> |> thanks |> ale |> |> - Original Message - |> > From: "John DeSantis" |> > To: "Alessandro Federico" |> > Cc: "Slurm User Community List" , |> > "Isabella Baccarelli" , |> > hpc-sysmgt-i...@cineca.it |> > Sent: Wednesday, January 17, 2018 3:30:43 PM |> > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on |> > send/recv operation |> > |> > Ale, |> > |> > > As Matthieu said it seems something related to SSS daemon. |> > |> > That was a great catch by Matthieu. |> > |> > > Moreover, only 3 SLURM partitions have the AllowGroups ACL |> > |> > Correct, which may seem negligent, but after each `scontrol |> > reconfigure`, slurmctld restart, and/or AllowGroups= partition |> > update, |> > the mapping of UID's for each group will be updated. |> > |> > > So why does the UID-GID mapping take so long? |> > |> > We attempted to use "AllowGroups" previously, but we found (even |> > with |> > sssd.conf tuning regarding enumeration) that unless the group was |> > local |> > (/etc/group), we were experiencing delays before the AllowGroups |> > parameter was respected. This is why we opted to use SLURM's |> > AllowQOS/AllowAccounts instead. |> > |> > You would have to enable debugging on your remote authentication |> > software to see where the hang-up is occurring (if it is that at |> > all, |> > and not just a delay with the slurmctld). |> > |> > Given the direction that this is going - why not replace the |> > "AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="? |> > |> > > @John: we defined many partitions on the same nodes but in the |> > > production cluster they will be more or less split across the 6K |> > > nodes. |> > |> > Ok, that makes sense. Looking initially at your partition |> > definitions, |> > I immediately thought of being DRY, especially since the "finer" |> > tuning |> > between the partitions could easily be controlled via the QOS' |> > allowed |> > to access the resources. |> > |> > John DeSantis |> > |> > On Wed, 17 Jan 2018 13:20:49 +0100 |> > Alessandro Federico wrote: |> > |> > > Hi Matthieu & John |> > > |> > > this is the backtrace of slurmctld during the slowdown |> > > |> > > (gdb) bt |> > > #0 0x7fb0e8b1e69d in poll () from /lib64/libc.so.6 |> > > #1 0x7fb0e8617bfa in sss_cli_make_request_nochecks () |> > > from /lib64/libnss_sss.so.2 #2 0x7fb0e86185a3 in |> > > sss_nss_make_request () from /lib64/libnss_sss.so.2 #3 |> > > 0x7fb0e8619104 in _nss_sss_getpwnam_r () |> > > from /lib64/libnss_sss.so.2 #4 0x7fb0e8aef07d in |> > > getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5 |> > > 0x7fb0e9360986 in _getpwnam_r (result=, |> > > bufsiz=, buf=, pwd=, |> > > name=) at uid.c:73 #6 uid_from_string |> > > (name=0x1820e41 |> > > "g2bottin", uidp=uidp@entry=0x7fff07f03a6c) at uid.c:111 #7 |> > > 0x0043587d in get_group_members (group_name=0x10ac500 |> > > "g2") |> > > at groups.c:139 #8 0x0047525a in _get_groups_members |> > > (group_names=) at partition_mgr.c:2006 #9 |> > > 0x00475505 in _update_part_uid_access_list |> > > (x=0x7fb03401e650, |> > > arg=0x7fff07f13bf4) at partition_mgr.c:1930 #10 |>
[slurm-users] Areas for improvement on our site's cluster scheduling
We have two main issues with our scheduling policy right now. The first is an issue that we call "queue stuffing." The second is an issue with interactive job availability. We aren't confused about why these issues exist, but we aren't sure the best way to address them. I'd love to hear any suggestions on how other sites address these issues. Thanks for any advice! ## Queue stuffing We use multifactor scheduling to provide account-based fairshare scheduling as well as standard fifo-style job aging. In general, this works pretty well, and accounts meet their scheduling targets; however, every now and again, we have a user who has a relatively high-throughput (not HPC) workload that they're willing to wait a significant period of time for. They're low-priority work, but they put a few thousand jobs into the queue, and just sit and wait. Eventually the job aging makes the jobs so high-priority, compared to the fairshare, that they all _as a set_ become higher-priority than the rest of the work on the cluster. Since they continue to age as the other jobs continue to age, these jobs end up monopolizing the cluster for days at a time, as their high volume of relatively small jobs use up a greater and greater percentage of the machine. In Moab I'd address this by limiting the number of jobs the user could have *eligible* at any given time; but it appears that the only option for slurm is limiting the number of jobs a user can *submit*, which isn't as nice a user experience and can lead to some pathological user behaviors (like users running cron jobs that wake repeatedly and submit more jobs automatically). ## Interactive job availability I'm becoming increasingly convinced that holding some portion of our resource aside as dedicated for relatively short, small, interactive jobs is a unique good; but I'm not sure how best to implement it. My immediate thought was to use a reservation with the DAILY and REPLACE flags. I particularly like the idea of using the REPLACE flag here as we could keep a flexible amount of resources available irrespective of how much was actually being used for the purpose at any given time; but it doesn't appear that there's any way to limit the per-user use of resources *within* a reservation; so if we created such a reservation and granted all users access to it, any individual user would be capable of consuming all resources in the reservation anyway. I'd have a dedicated "interactive" qos or similar to put such restrictions on; but there doesn't appear to be a way to then limit the use of the reservation to only jobs with that qos. (Aside from job_submit scripts or similar. Please correct me if I'm wrong.) In lieu of that, I'm leaning towards having a dedicated interactive partition that we'd manually move some resources to; but that's a bit less flexible.
Re: [slurm-users] Getting nodes in a partition
Does sinfo -h -O nodehost -p partition | sort help? Also scontrol show hostname nodelist where nodelist is compute-0-\[4-6\] would work. Regards Henk -Original Message- From: slurm-users On Behalf Of Mahmood Naderan Sent: 18 May 2018 08:12 To: Slurm User Community List Subject: [slurm-users] Getting nodes in a partition Hi, Is there any slurm variable to read the node names of a partition? There is an MPI option --hostfile which we can write the node names. I want to use something like this in the sbatch script: #SBATCH --partition=MYPART ... --hostfile $SLURM_NODES_IN_PARTITION I can manually manipulate the output of scontrol to extract node names. Like this: [mahmood@rocks7 ~]$ scontrol show partition MYPART | grep -w Nodes | cut -d '=' -f 2 compute-0-[4-6] But that basically is not compute-0-4 compute-0-5 compute-0-6 Which I have to post process more. Any better idea? Regards, Mahmood
[slurm-users] allocate more resources for a current interactive job
Dear Slurm users, Is it possible to allocate more resources for a current job on an interactive shell? I just allocate (by default) 1 core and 2Gb RAM: srun -I -p main --pty /bin/bash The node and queue where the job is located has 120 Gb and 4 cores available. I just want to use more cores and more RAM for such shell session without having to terminate it. Thanks in advance
[slurm-users] Slurm Environment Variable for Memory
Dear Community, does anyone know whether there is an environment variable, such as $SLURM_CPUS_ON_NODE, but for the requested RAM (by using --mem argument)? Thanks
Re: [slurm-users] Slurm Environment Variable for Memory
That variable does not exist somehow on my environment. Is it possible my Slurm version (17.02.3) does not include it? Thanks On 17/08/18 11:04, Bjørn-Helge Mevik wrote: Yes. It is documented in sbatch(1): SLURM_MEM_PER_CPU Same as --mem-per-cpu SLURM_MEM_PER_NODE Same as --mem
Re: [slurm-users] Slurm Environment Variable for Memory
I am just running an interactive job with "srun -I --pty /bin/bash" and then run "echo $SLURM_MEM_PER_NODE", but it shows nothing. Does it have to be defined in any conf file? On 20/08/18 09:59, Chris Samuel wrote: On Monday, 20 August 2018 4:43:57 PM AEST Juan A. Cordero Varelaq wrote: That variable does not exist somehow on my environment. Is it possible my Slurm version (17.02.3) does not include it? They should be there, from the NEWS file they were introduced in 2.3.0.rc1. Is something else nuking your shells environment perhaps? 17.02.11 is the last released version of 17.02.x and all previous versions have been pulled from the SchedMD website due to CVE-2018-10995. cheers, Chris
[slurm-users] Installing GPU Features of Slurm 20
Hi! I'm trying to install Slurm 20.02 on my cluster with the GPU features. However, only my compute nodes have GPUs attached and so when I try to install the slurm-slurmctld RPM on my head node it fails saying it requires the NVIDIA control software. How do other folks work around this? Do you have one RPM built for the control node and one for the compute nodes? Do you just force the install on the control node? Thanks!
[slurm-users] slurmdbd does not work
Hi everyone, I am having trouble getting *slurmdbd* to work. This is the error I get: *error: Couldn't find the specified plugin name for accounting_storage/mysql looking at all fileserror: cannot find accounting_storage plugin for accounting_storage/mysqlerror: cannot create accounting_storage context for accounting_storage/mysqlfatal: Unable to initialize accounting_storage/mysql accounting storage plugin* I have installed *mysql* (*apt install mysql*) on Ubuntu 20.04.03 and followed the instructions on the slurm website <https://slurm.schedmd.com/accounting.html>; *mysql* is running (*port 3306*) and these are the relevant parts in my *.conf* files: *slurm.conf* # LOGGING AND ACCOUNTING AccountingStorageHost=localhost AccountingStoragePort=3306 AccountingStorageType=accounting_storage/slurmdbd AccountingStorageUser=slurm JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/linux SlurmctldDebug=info SlurmctldLogFile=/var/log/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurmd.log *slurmdbd.conf* AuthType=auth/munge DbdAddr=localhost DbdHost=localhost DbdPort=3306 LogFile=/var/log/slurmdbd.log PidFile=/var/run/slurmdbd.pid PluginDir=/usr/lib/slurm SlurmUser=slurm StoragePass=password StorageType=accounting_storage/mysql StorageUser=slurm StorageLoc=slurm_acct_db I changed the port to 3306 because otherwise *slurmdbd *could not communicate with *mysql*. If I run *sacct*, for example, I get: *sacct: error: _slurm_persist_recv_msg: read of fd 3 failed: No errorsacct: error: _slurm_persist_recv_msg: only read 126 of 2616 bytessacct: error: slurm_persist_conn_open: No response to persist_initsacct: error: Sending PersistInit msg: No errorJobID JobName PartitionAccount AllocCPUS State ExitCode -- -- -- -- -- sacct: error: _slurm_persist_recv_msg: read of fd 3 failed: No errorsacct: error: _slurm_persist_recv_msg: only read 126 of 2616 bytessacct: error: Sending PersistInit msg: No errorsacct: error: DBD_GET_JOBS_COND failure: Unspecified error* Does anyone have a suggestion to solve this problem? Thank you very much. Best, Giuseppe
Re: [slurm-users] slurmdbd does not work
Thanks for the answer, Brian. I now added --with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and now also slurmctld does not work, with the error: [2021-12-03T15:36:41.018] accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd [2021-12-03T15:36:41.019] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150 of 2613 bytes [2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of 2613 bytes [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of 2613 bytes [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.020] error: DBD_GET_TRES failure: No error [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 0 of 2613 bytes [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.021] error: DBD_GET_QOS failure: No error [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 150 of 2613 bytes [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.021] error: DBD_GET_USERS failure: No error [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of 2613 bytes [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.022] error: DBD_GET_ASSOCS failure: No error [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for fd 9 experienced error[104]: Connection reset by peer [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of 2613 bytes [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error [2021-12-03T15:36:41.022] error: DBD_GET_RES failure: No error [2021-12-03T15:36:41.022] fatal: You are running with a database but for some reason we have no TRES from it. This should only happen if the database is down and you don't have any state files. On Thu, Dec 2, 2021 at 10:36 PM Brian Andrus wrote: > > Your slurm needs built with the support. If you have mysql-devel installed > it should pick it up, otherwise you can specify the location with > --with-mysql when you configure/build slurm > > Brian Andrus > On 12/2/2021 12:40 PM, Giuseppe G. A. Celano wrote: > > Hi everyone, > > I am having trouble getting *slurmdbd* to work. This is the error I get: > > > > > *error: Couldn't find the specified plugin name for > accounting_storage/mysql looking at all files error: cannot find > accounting_storage plugin for accounting_storage/mysql error: cannot create > accounting_storage context for accounting_storage/mysql fatal: Unable to > initialize accounting_storage/mysql accounting storage plugin* > > I have installed *mysql* (*apt install mysql*) on Ubuntu 20.04.03 and > followed the instructions on the slurm website > <https://slurm.schedmd.com/accounting.html>; *mysql* is running (*port > 3306*) and these are the relevant parts in my *.conf* files: > > *slurm.conf* > > # LOGGING AND ACCOUNTING > AccountingStorageHost=localhost > AccountingStoragePort=3306 > AccountingStorageType=accounting_storage/slurmdbd > AccountingStorageUser=slurm > JobCompType=jobcomp/none > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/linux > SlurmctldDebug=info > SlurmctldLogFile=/var/log/slurmctld.log > SlurmdDebug=info > SlurmdLogFile=/var/log/slurmd.log > > *slurmdbd.conf* > > AuthType=auth/munge > DbdAddr=localhost > DbdHost=localhost > DbdPort=3306 > LogFile=/var/log/slurmdbd.log > PidFile=/var/run/slurmdbd.pid > PluginDir=/usr/lib/slurm > SlurmUser=slurm > StoragePass=password > StorageType=accounting_storage/mysql > StorageUser=slurm > StorageLoc=slurm_acct_db > > I changed the port to 3306 because otherwise *slurmdbd *could not > communicate with *mysql*. If I run *sacct*, for example, I get: > > > > > > > > > > > *sacct: error: _slurm_persist_recv_msg: read of fd 3 failed: No error > sacct
Re: [slurm-users] slurmdbd does not work
The problem is the lack of /usr/lib/slurm/accounting_storage_mysql.so I have installed many mariadb-related packages, but that file is not created by slurm after installation: is there a point in the documentation where the installation procedure for the database is made explicit? On Fri, Dec 3, 2021 at 5:15 PM Brian Andrus wrote: > You will need to also reinstall/restart slurmdbd with the updated binary. > > Look in the slurmdbd logs to see what is happening there. I suspect it had > errors updating/creating the database and tables. If you have no data in it > yet, you can just DROP the database and restart slurmdbd. > > Brian Andrus > On 12/3/2021 6:42 AM, Giuseppe G. A. Celano wrote: > > Thanks for the answer, Brian. I now added > --with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and > now also slurmctld does not work, with the error: > > [2021-12-03T15:36:41.018] accounting_storage/slurmdbd: > clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 > with slurmdbd > [2021-12-03T15:36:41.019] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.020] error: DBD_GET_TRES failure: No error > [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 0 of > 2613 bytes > [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.021] error: DBD_GET_QOS failure: No error > [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.021] error: DBD_GET_USERS failure: No error > [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of > 2613 bytes > [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.022] error: DBD_GET_ASSOCS failure: No error > [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of > 2613 bytes > [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.022] error: DBD_GET_RES failure: No error > [2021-12-03T15:36:41.022] fatal: You are running with a database but for > some reason we have no TRES from it. This should only happen if the > database is down and you don't have any state files. > > > > On Thu, Dec 2, 2021 at 10:36 PM Brian Andrus wrote: > >> >> Your slurm needs built with the support. If you have mysql-devel >> installed it should pick it up, otherwise you can specify the location with >> --with-mysql when you configure/build slurm >> >> Brian Andrus >> On 12/2/2021 12:40 PM, Giuseppe G. A. Celano wrote: >> >> Hi everyone, >> >> I am having trouble getting *slurmdbd* to work. This is the error I get: >> >> >> >> >> *error: Couldn't find the specified plugin name for >> accounting_storage/mysql looking at all files error: cannot find >> accounting_storage plugin for accounting_storage/mysql error: cannot create >> accounting_storage context for accounting_storage/mysql fatal: Unable to >> initialize accounting_storage/mysql accounting storage plugin* >> >> I have installed *mysql* (*apt install mysql*) on Ubuntu 20.04.03 and >> followed the instructions on the slurm website >> <https://slurm.schedmd.com/accounting.html>; *mysql* is running (*port >> 3306*) and these are the relevant parts in my *.conf
Re: [slurm-users] [EXT] Re: slurmdbd does not work
After installation of libmariadb-dev, I have reinstalled the entire slurm with ./configure + options, make, and make install. Still, accounting_storage_mysql.so is missing. On Sat, Dec 4, 2021 at 12:24 AM Sean Crosby wrote: > Did you run > > ./configure (with any other options you normally use) > make > make install > > on your DBD server after you installed the mariadb-devel package? > > -- > *From:* slurm-users on behalf of > Giuseppe G. A. Celano > *Sent:* Saturday, 4 December 2021 10:07 > *To:* Slurm User Community List > *Subject:* [EXT] Re: [slurm-users] slurmdbd does not work > > * External email: Please exercise caution * > -- > The problem is the lack of /usr/lib/slurm/accounting_storage_mysql.so > > I have installed many mariadb-related packages, but that file is not > created by slurm after installation: is there a point in the documentation > where the installation procedure for the database is made explicit? > > > > On Fri, Dec 3, 2021 at 5:15 PM Brian Andrus wrote: > > You will need to also reinstall/restart slurmdbd with the updated binary. > > Look in the slurmdbd logs to see what is happening there. I suspect it had > errors updating/creating the database and tables. If you have no data in it > yet, you can just DROP the database and restart slurmdbd. > > Brian Andrus > On 12/3/2021 6:42 AM, Giuseppe G. A. Celano wrote: > > Thanks for the answer, Brian. I now added > --with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and > now also slurmctld does not work, with the error: > > [2021-12-03T15:36:41.018] accounting_storage/slurmdbd: > clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 > with slurmdbd > [2021-12-03T15:36:41.019] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.020] error: DBD_GET_TRES failure: No error > [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 0 of > 2613 bytes > [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.021] error: DBD_GET_QOS failure: No error > [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.021] error: DBD_GET_USERS failure: No error > [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of > 2613 bytes > [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.022] error: DBD_GET_ASSOCS failure: No error > [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of > 2613 bytes > [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.022] error: DBD_GET_RES failure: No error > [2021-12-03T15:36:41.022] fatal: You are running with a database but for > some reason we have no TRES from it. This should only happen if the > database is down and you don't have any state files. > > > > On Thu, Dec 2, 2021 at 10:36 PM Brian Andrus wrote: > > > Your slurm needs built with the support. If you have mysql-devel installed > it should pick it up, otherwise you can specify the location with > --with-mysql when you configure/build slurm > > Brian Andrus > On 12/2/2021 12:40 PM, Giuseppe G. A. Celano wrote: > > Hi everyone, > > I am having trouble g
Re: [slurm-users] [EXT] Re: slurmdbd does not work
10.4.22 On Sat, Dec 4, 2021 at 1:35 AM Brian Andrus wrote: > Which version of Mariadb are you using? > > Brian Andrus > On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote: > > After installation of libmariadb-dev, I have reinstalled the entire slurm > with ./configure + options, make, and make install. Still, > accounting_storage_mysql.so is missing. > > > > On Sat, Dec 4, 2021 at 12:24 AM Sean Crosby > wrote: > >> Did you run >> >> ./configure (with any other options you normally use) >> make >> make install >> >> on your DBD server after you installed the mariadb-devel package? >> >> -- >> *From:* slurm-users on behalf of >> Giuseppe G. A. Celano >> *Sent:* Saturday, 4 December 2021 10:07 >> *To:* Slurm User Community List >> *Subject:* [EXT] Re: [slurm-users] slurmdbd does not work >> >> * External email: Please exercise caution * >> -- >> The problem is the lack of /usr/lib/slurm/accounting_storage_mysql.so >> >> I have installed many mariadb-related packages, but that file is not >> created by slurm after installation: is there a point in the documentation >> where the installation procedure for the database is made explicit? >> >> >> >> On Fri, Dec 3, 2021 at 5:15 PM Brian Andrus wrote: >> >> You will need to also reinstall/restart slurmdbd with the updated binary. >> >> Look in the slurmdbd logs to see what is happening there. I suspect it >> had errors updating/creating the database and tables. If you have no data >> in it yet, you can just DROP the database and restart slurmdbd. >> >> Brian Andrus >> On 12/3/2021 6:42 AM, Giuseppe G. A. Celano wrote: >> >> Thanks for the answer, Brian. I now added >> --with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and >> now also slurmctld does not work, with the error: >> >> [2021-12-03T15:36:41.018] accounting_storage/slurmdbd: >> clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 >> with slurmdbd >> [2021-12-03T15:36:41.019] error: _conn_readable: persistent connection >> for fd 9 experienced error[104]: Connection reset by peer >> [2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150 >> of 2613 bytes >> [2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error >> [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection >> for fd 9 experienced error[104]: Connection reset by peer >> [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 >> of 2613 bytes >> [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error >> [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection >> for fd 9 experienced error[104]: Connection reset by peer >> [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 >> of 2613 bytes >> [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error >> [2021-12-03T15:36:41.020] error: DBD_GET_TRES failure: No error >> [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection >> for fd 9 experienced error[104]: Connection reset by peer >> [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 0 of >> 2613 bytes >> [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error >> [2021-12-03T15:36:41.021] error: DBD_GET_QOS failure: No error >> [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection >> for fd 9 experienced error[104]: Connection reset by peer >> [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 150 >> of 2613 bytes >> [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error >> [2021-12-03T15:36:41.021] error: DBD_GET_USERS failure: No error >> [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection >> for fd 9 experienced error[104]: Connection reset by peer >> [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of >> 2613 bytes >> [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error >> [2021-12-03T15:36:41.022] error: DBD_GET_ASSOCS failure: No error >> [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection >> for fd 9 experienced error[104]: Connection reset by peer >> [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of >> 2613 bytes >> [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error >> [2021-12-03T15:36:41.022] error: DBD_GET_RES failure: No error >> [2021-12-03T15:36:41.022] fatal: You are running with a d
Re: [slurm-users] [EXT] Re: slurmdbd does not work
I have installed almost all of the possible packages, but that file doesn't show up: libdbd-mariadb-perl/focal,now 1.11-3ubuntu2 amd64 [installed] libmariadb-dev-compat/unknown,now 1:10.4.22+maria~focal amd64 [installed] libmariadb-dev/unknown,now 1:10.4.22+maria~focal amd64 [installed] libmariadb3-compat/unknown,now 1:10.4.22+maria~focal amd64 [installed] libmariadb3/unknown,now 1:10.4.22+maria~focal amd64 [installed,automatic] libmariadbclient18/unknown,now 1:10.4.22+maria~focal amd64 [installed] libmariadbd-dev/unknown,now 1:10.4.22+maria~focal amd64 [installed] libmariadbd19/unknown,now 1:10.4.22+maria~focal amd64 [installed] mariadb-client-10.4/unknown,now 1:10.4.22+maria~focal amd64 [installed,automatic] mariadb-client-core-10.4/unknown,now 1:10.4.22+maria~focal amd64 [installed] mariadb-client/unknown,unknown,unknown,now 1:10.4.22+maria~focal all [installed] mariadb-common/unknown,unknown,unknown,now 1:10.4.22+maria~focal all [installed] mariadb-plugin-connect/unknown,now 1:10.4.22+maria~focal amd64 [installed] mariadb-server-10.4/unknown,now 1:10.4.22+maria~focal amd64 [installed] mariadb-server-core-10.4/unknown,now 1:10.4.22+maria~focal amd64 [installed] mariadb-server/unknown,unknown,unknown,now 1:10.4.22+maria~focal all [installed] odbc-mariadb/focal,now 3.1.4-1 amd64 [installed] On Sat, Dec 4, 2021 at 2:06 AM Sean Crosby wrote: > Try installing the libmariadb-dev-compat package and trying the > configure/make again. It provides "libmysqlclient.so", whereas > libmariadb-dev provides "libmariadb.so" > -- > *From:* slurm-users on behalf of > Giuseppe G. A. Celano > *Sent:* Saturday, 4 December 2021 11:40 > *To:* Slurm User Community List > *Subject:* Re: [slurm-users] [EXT] Re: slurmdbd does not work > > * External email: Please exercise caution * > -- > 10.4.22 > > > On Sat, Dec 4, 2021 at 1:35 AM Brian Andrus wrote: > > Which version of Mariadb are you using? > > Brian Andrus > On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote: > > After installation of libmariadb-dev, I have reinstalled the entire slurm > with ./configure + options, make, and make install. Still, > accounting_storage_mysql.so is missing. > > > > On Sat, Dec 4, 2021 at 12:24 AM Sean Crosby > wrote: > > Did you run > > ./configure (with any other options you normally use) > make > make install > > on your DBD server after you installed the mariadb-devel package? > > -- > *From:* slurm-users on behalf of > Giuseppe G. A. Celano > *Sent:* Saturday, 4 December 2021 10:07 > *To:* Slurm User Community List > *Subject:* [EXT] Re: [slurm-users] slurmdbd does not work > > * External email: Please exercise caution * > -- > The problem is the lack of /usr/lib/slurm/accounting_storage_mysql.so > > I have installed many mariadb-related packages, but that file is not > created by slurm after installation: is there a point in the documentation > where the installation procedure for the database is made explicit? > > > > On Fri, Dec 3, 2021 at 5:15 PM Brian Andrus wrote: > > You will need to also reinstall/restart slurmdbd with the updated binary. > > Look in the slurmdbd logs to see what is happening there. I suspect it had > errors updating/creating the database and tables. If you have no data in it > yet, you can just DROP the database and restart slurmdbd. > > Brian Andrus > On 12/3/2021 6:42 AM, Giuseppe G. A. Celano wrote: > > Thanks for the answer, Brian. I now added > --with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and > now also slurmctld does not work, with the error: > > [2021-12-03T15:36:41.018] accounting_storage/slurmdbd: > clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 > with slurmdbd > [2021-12-03T15:36:41.019] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for > fd 9 experienced error[104]: Connection reset by peer > [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of > 2613 bytes > [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error > [2021-12-03T15:36:41.02
Re: [slurm-users] [EXT] Re: slurmdbd does not work
Hi Gennaro, That helped: slurm-wlm has accounting_storage_mysql.so, and I moved it to the location requested by the first slurm installation. Everything seems to work, even if I had to change the location of the .conf files, probably because this is required by the new slurm-wlm installation. I am not sure whether I should try to uninstall my previous installation and reinstall slurm-wlm... On Sat, Dec 4, 2021 at 12:38 PM Gennaro Oliva wrote: > Ciao Giuseppe, > > On Sat, Dec 04, 2021 at 02:30:40AM +0100, Giuseppe G. A. Celano wrote: > > I have installed almost all of the possible packages, but that file > doesn't > > show up: > > can you please specify what options are you using with ./configure? > > If you don't specify any prefix (--prefix option), the default location > for your installation is /usr/local, so you should find the plugins under > /usr/local/lib/slurm > > Did you tried the slurm-wlm package shipped with ubuntu? > It comes with the mysql plugin. > Best regards > -- > Gennaro Oliva > >
Re: [slurm-users] [EXT] Re: slurmdbd does not work
Hi, I have reinstalled slurm using the ubuntu package slurm-wlm (and some related ones). After solving some problems with the directories where the pid files are stored (I keep getting the message "Can't open PID file /run/slurm/slurmd.pid (yet?) after start: Operation not permitted", even if the directory has slurm as owner and group). The services slurmdbd, slurmctld, and slurmd work, but I cannot use the commands sinfo, srun, etc.. because I get the errors: sinfo: symbol lookup error: sinfo: undefined symbol: slurm_conf srun: symbol lookup error: srun: undefined symbol: xfree_ptr sacct: symbol lookup error: sacct: undefined symbol: slurm_destroy_selected_step Does anyone know the reason for that? Thanks. Best, Giuseppe On Sat, Dec 4, 2021 at 5:31 PM Giuseppe G. A. Celano < giuseppegacel...@gmail.com> wrote: > Hi Gennaro, > > That helped: slurm-wlm has accounting_storage_mysql.so, and I moved it to > the location requested by the first slurm installation. Everything seems to > work, even if I had to change the location of the .conf files, probably > because this is required by the new slurm-wlm installation. I am not sure > whether I should try to uninstall my previous installation and reinstall > slurm-wlm... > > > On Sat, Dec 4, 2021 at 12:38 PM Gennaro Oliva > wrote: > >> Ciao Giuseppe, >> >> On Sat, Dec 04, 2021 at 02:30:40AM +0100, Giuseppe G. A. Celano wrote: >> > I have installed almost all of the possible packages, but that file >> doesn't >> > show up: >> >> can you please specify what options are you using with ./configure? >> >> If you don't specify any prefix (--prefix option), the default location >> for your installation is /usr/local, so you should find the plugins under >> /usr/local/lib/slurm >> >> Did you tried the slurm-wlm package shipped with ubuntu? >> It comes with the mysql plugin. >> Best regards >> -- >> Gennaro Oliva >> >>
Re: [slurm-users] [EXT] Re: slurmdbd does not work
Grazie Gennaro, It's working! On Mon, Dec 6, 2021 at 9:41 AM Gennaro Oliva wrote: > Ciao Giuseppe, > > On Mon, Dec 06, 2021 at 03:46:02AM +0100, Giuseppe G. A. Celano wrote: > > sinfo: symbol lookup error: sinfo: undefined symbol: slurm_conf > > srun: symbol lookup error: srun: undefined symbol: xfree_ptr > > sacct: symbol lookup error: sacct: undefined symbol: > > slurm_destroy_selected_step > > > > Does anyone know the reason for that? Thanks. > > please check that you are using the client tools from the slurm package > and not those coming from the source installation. The command: > > which srun > > should return /usr/bin/srun and not /usr/local/bin/srun > > In the latter case remove everyting related to slurm under /usr/local > > /usr/local/share/doc/slurm* > /usr/local/sbin/slurm* > /usr/local/lib/libslurm* > /usr/local/lib/slurm > /usr/local/include/slurm > > /usr/local/bin/scancel > /usr/local/bin/sprio > /usr/local/bin/sdiag > /usr/local/bin/srun > /usr/local/bin/squeue > /usr/local/bin/sbcast > /usr/local/bin/sview > /usr/local/bin/salloc > /usr/local/bin/scontrol > /usr/local/bin/sreport > /usr/local/bin/sbatch > /usr/local/bin/strigger > /usr/local/bin/sacctmgr > /usr/local/bin/sacct > /usr/local/bin/sattach > /usr/local/bin/scrontab > /usr/local/bin/sh5util > /usr/local/bin/sstat > /usr/local/bin/sinfo > /usr/local/bin/sshare > > Look also for files under: > > /usr/local/share/man/ > > Best regards, > -- > Gennaro Oliva > >
[slurm-users] GPU configuration
Hi, My cluster has 2 nodes, with the first having 2 gpus and the second 1 gpu. The states of both nodes is "drained" because "gres/gpu count reported lower than configured": any idea why this happens? Thanks. My .conf files are: slurm.conf AccountingStorageTRES=gres/gpu GresTypes=gpu NodeName=technician Gres=gpu:2 CPUs=28 RealMemory=128503 Boards=1 SocketsPerBoard=1 CoresPerSocket=14 ThreadsPerCore=2 State=UNKNOWN NodeName=worker0 Gres=gpu:1 CPUs=12 RealMemory=15922 Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP gres.conf NodeName=technician Name=gpu File=/dev/nvidia[0-1] NodeName=worker0 Name=gpu File=/dev/nvidia0 Best, Giuseppe
[slurm-users] Can't start slurmdbd
Hi, Slurm 17.02.3 was installed on my cluster some time ago but recently I decided to use SlurmDBD for the accounting. After installing several packages (slurm-devel, slurm-munge, slurm-perlapi, slurm-plugins, slurm-slurmdbd and slurm-sql) and MariaDB in CentOS 7, I created an SQL database: mysql> grant all on slurm_acct_db.* TO 'slurm'@'localhost' -> identified by 'some_pass' with grant option; mysql> create database slurm_acct_db; and configured the slurmdbd.conf file: AuthType=auth/munge DbdAddr=localhost DbdHost=localhost SlurmUser=slurm DebugLevel=4 LogFile=/var/log/slurm/slurmdbd.log PidFile=/var/run/slurmdbd.pid StorageType=accounting_storage/mysql StorageHost=localhost StoragePass=some_pass StorageUser=slurm StorageLoc=slurm_acct_db Then, I stopped the slurmctl daemon on the head node of my cluster and tried to start `slurmdbd`, but I got the following: $ systemctl start slurmdbd Job for slurmdbd.service failed because the control process exited with error code. See "systemctl status slurmdbd.service" and "journalctl -xe" for details. $ systemctl status slurmdbd.service ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since lun 2017-11-20 10:39:26 CET; 53s ago Process: 27592 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=1/FAILURE) nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD accounting daemon... nov 20 10:39:26 login_node systemd[1]: slurmdbd.service: control process exited, code=exited status=1 nov 20 10:39:26 login_node systemd[1]: Failed to start Slurm DBD accounting daemon. nov 20 10:39:26 login_node systemd[1]: Unit slurmdbd.service entered failed state. nov 20 10:39:26 login_node systemd[1]: slurmdbd.service failed. $ journalctl -xe nov 20 10:39:26 login_node polkitd[1078]: Registered Authentication Agent for unix-process:27586:119889015 (system bus name :1.871 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /or nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD accounting daemon... -- Subject: Unit slurmdbd.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit slurmdbd.service has begun starting up. nov 20 10:39:26 login_node systemd[1]: slurmdbd.service: control process exited, code=exited status=1 nov 20 10:39:26 login_node systemd[1]: Failed to start Slurm DBD accounting daemon. -- Subject: Unit slurmdbd.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit slurmdbd.service has failed. -- -- The result is failed. nov 20 10:39:26 login_node systemd[1]: Unit slurmdbd.service entered failed state. nov 20 10:39:26 login_node systemd[1]: slurmdbd.service failed. nov 20 10:39:26 login_node polkitd[1078]: Unregistered Authentication Agent for unix-process:27586:119889015 (system bus name :1.871, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, nov 20 10:40:06 login_node gmetad[1519]: data_thread() for [HPCSIE] failed to contact node 192.168.2.10 nov 20 10:40:06 login_node gmetad[1519]: data_thread() got no answer from any [HPCSIE] datasource nov 20 10:40:13 login_node dhcpd[2320]: DHCPREQUEST for 192.168.2.19 from XX:XX:XX:XX:XX:XX via enp6s0f1 nov 20 10:40:13 login_node dhcpd[2320]: DHCPACK on 192.168.2.19 to XX:XX:XX:XX:XX:XX via enp6s0f1 nov 20 10:40:39 login_node dhcpd[2320]: DHCPREQUEST for 192.168.2.13 from XX:XX:XX:XX:XX:XX via enp6s0f1 nov 20 10:40:39 login_node dhcpd[2320]: DHCPACK on 192.168.2.13 to XX:XX:XX:XX:XX:XX via enp6s0f1 I've just found out the file `/var/run/slurmdbd.pid` does not even exist. I'd appreciate any hint on this issue. Thanks
Re: [slurm-users] Can't start slurmdbd
I did that but got the same errors. slurmdbd.log contains by the way the following: [2017-11-20T12:39:04.178] error: Couldn't find the specified plugin name for accounting_storage/mysql looking at all files [2017-11-20T12:39:04.179] error: cannot find accounting_storage plugin for accounting_storage/mysql [2017-11-20T12:39:04.179] error: cannot create accounting_storage context for accounting_storage/mysql [2017-11-20T12:39:04.179] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin It seems it lacks the accounting_storage_mysql.so: $ ls /usr/lib64/slurm/accounting_storage_* /usr/lib64/slurm/accounting_storage_filetxt.so /usr/lib64/slurm/accounting_storage_none.so /usr/lib64/slurm/accounting_storage_slurmdbd.so However, I did install the slurm-sql rpm package. Any idea about what's failing? Thanks On 20/11/17 12:11, Lachlan Musicman wrote: On 20 November 2017 at 20:50, Juan A. Cordero Varelaq mailto:bioinformatica-i...@us.es>> wrote: $ systemctl start slurmdbd Job for slurmdbd.service failed because the control process exited with error code. See "systemctl status slurmdbd.service" and "journalctl -xe" for details. $ systemctl status slurmdbd.service ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since lun 2017-11-20 10:39:26 CET; 53s ago Process: 27592 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=1/FAILURE) nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD accounting daemon... nov 20 10:39:26 login_node systemd[1]: slurmdbd.service: control process exited, code=exited status=1 nov 20 10:39:26 login_node systemd[1]: Failed to start Slurm DBD accounting daemon. nov 20 10:39:26 login_node systemd[1]: Unit slurmdbd.service entered failed state. nov 20 10:39:26 login_node systemd[1]: slurmdbd.service failed. $ journalctl -xe nov 20 10:39:26 login_node polkitd[1078]: Registered Authentication Agent for unix-process:27586:119889015 (system bus name :1.871 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /or nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD accounting daemon... -- Subject: Unit slurmdbd.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel <http://lists.freedesktop.org/mailman/listinfo/systemd-devel> -- -- Unit slurmdbd.service has begun starting up. nov 20 10:39:26 login_node systemd[1]: slurmdbd.service: control process exited, code=exited status=1 nov 20 10:39:26 login_node systemd[1]: Failed to start Slurm DBD accounting daemon. -- Subject: Unit slurmdbd.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel <http://lists.freedesktop.org/mailman/listinfo/systemd-devel> -- -- Unit slurmdbd.service has failed. -- -- The result is failed. nov 20 10:39:26 login_node systemd[1]: Unit slurmdbd.service entered failed state. nov 20 10:39:26 login_node systemd[1]: slurmdbd.service failed. nov 20 10:39:26 login_node polkitd[1078]: Unregistered Authentication Agent for unix-process:27586:119889015 (system bus name :1.871, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, nov 20 10:40:06 login_node gmetad[1519]: data_thread() for [HPCSIE] failed to contact node 192.168.2.10 nov 20 10:40:06 login_node gmetad[1519]: data_thread() got no answer from any [HPCSIE] datasource nov 20 10:40:13 login_node dhcpd[2320]: DHCPREQUEST for 192.168.2.19 from XX:XX:XX:XX:XX:XX via enp6s0f1 nov 20 10:40:13 login_node dhcpd[2320]: DHCPACK on 192.168.2.19 to XX:XX:XX:XX:XX:XX via enp6s0f1 nov 20 10:40:39 login_node dhcpd[2320]: DHCPREQUEST for 192.168.2.13 from XX:XX:XX:XX:XX:XX via enp6s0f1 nov 20 10:40:39 login_node dhcpd[2320]: DHCPACK on 192.168.2.13 to XX:XX:XX:XX:XX:XX via enp6s0f1 I've just found out the file `/var/run/slurmdbd.pid` does not even exist. The pid file is the "process id" - it's only there if the process is running. So when slurmdbd is not running, it wont be there. Supposedly. Sometimes I do "touch /var/run/slurmdbd.pid" and try again? I've also found that using the host's short name is preferable to localhost. Make sure the host's short name is in /etc/hosts too. hostname -s will give you the short name Cheers L.
Re: [slurm-users] Can't start slurmdbd
I guess mariadb-devel was not installed by the time another person installed slurm. I have a bunch of slurm-* rpms I installed using "yum localinstall ...". Should I installed them in another way or remove slurm? The file accounting_storage_mysql.so is bythe way absent on the machine. Thanks On 20/11/17 21:52, Lachlan Musicman wrote: Also - make sure you have MariaDB-devel when you make the RPMs - that's the first bit. The second bit is you might have to find the accounting_storage_mysql.so and place it in /usr/lib64/slurm. I think it might end up in /path/to/rpmbuild/BUILD/sec/plugins/accounting/.libs/ or something like that Cheers L. -- "The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics is the insistence that we cannot ignore the truth, nor should we panic about it. It is a shared consciousness that our institutions have failed and our ecosystem is collapsing, yet we are still here — and we are creative agents who can shape our destinies. Apocalyptic civics is the conviction that the only way out is through, and the only way through is together. " /Greg Bloom/ @greggish https://twitter.com/greggish/status/873177525903609857 On 21 November 2017 at 06:35, Philip Kovacs <mailto:pkde...@yahoo.com>> wrote: Try adding this to your conf: PluginDir=/usr/lib64/slurm On Monday, November 20, 2017 6:48 AM, Juan A. Cordero Varelaq mailto:bioinformatica-i...@us.es>> wrote: I did that but got the same errors. slurmdbd.log contains by the way the following: [2017-11-20T12:39:04.178] error: Couldn't find the specified plugin name for accounting_storage/mysql looking at all files [2017-11-20T12:39:04.179] error: cannot find accounting_storage plugin for accounting_storage/mysql [2017-11-20T12:39:04.179] error: cannot create accounting_storage context for accounting_storage/mysql [2017-11-20T12:39:04.179] fatal: Unable to initialize accounting_storage/mysql accounting storage plugin It seems it lacks the accounting_storage_mysql.so: $ ls /usr/lib64/slurm/accounting_storage_* /usr/lib64/slurm/accounting_storage_filetxt.so /usr/lib64/slurm/accounting_storage_none.so /usr/lib64/slurm/accounting_storage_slurmdbd.so However, I did install the slurm-sql rpm package. Any idea about what's failing? Thanks On 20/11/17 12:11, Lachlan Musicman wrote: On 20 November 2017 at 20:50, Juan A. Cordero Varelaq mailto:bioinformatica-i...@us.es>> wrote: $ systemctl start slurmdbd Job for slurmdbd.service failed because the control process exited with error code. See "systemctl status slurmdbd.service" and "journalctl -xe" for details. $ systemctl status slurmdbd.service ● slurmdbd.service - Slurm DBD accounting daemon Loaded: loaded (/etc/systemd/system/slurmdbd. service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since lun 2017-11-20 10:39:26 CET; 53s ago Process: 27592 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=1/FAILURE) nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD accounting daemon... nov 20 10:39:26 login_node systemd[1]: slurmdbd.service: control process exited, code=exited status=1 nov 20 10:39:26 login_node systemd[1]: Failed to start Slurm DBD accounting daemon. nov 20 10:39:26 login_node systemd[1]: Unit slurmdbd.service entered failed state. nov 20 10:39:26 login_node systemd[1]: slurmdbd.service failed. $ journalctl -xe nov 20 10:39:26 login_node polkitd[1078]: Registered Authentication Agent for unix-process:27586:119889015 (system bus name :1.871 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /or nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD accounting daemon... -- Subject: Unit slurmdbd.service has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/m ailman/listinfo/systemd-devel <http://lists.freedesktop.org/mailman/listinfo/systemd-devel> -- -- Unit slurmdbd.service has begun starting up. nov 20 10:39:26 login_node systemd[1]: slurmdbd.service: control process exited, code=exited status=1 nov 20 10:39:26 login_node systemd[1]: Failed to start Slurm DBD accounting daemon. -- Subject: Unit slurmdbd.service has failed -- Defined-By: systemd -- Support: http://lists.freedesktop.org/m ailman/listinfo/systemd-devel <http://lists.freedesktop.org/mailman/l
[slurm-users] which daemons should I restart when editing slurm.conf
Hi, I have the following configuration: * head node: hosts the slurmctld and the slurmdbd daemons. * compute nodes (4): host the slurmd daemons. I need to change a couple of lines of the slurm.conf corresponding to the slurmctld. If I restart its service, should I also have to restart the slurmdbd on the head node and the slurmd daemons on compute nodes? Thanks
[slurm-users] Changing resource limits while running jobs
Hi, A couple of jobs have been running for almost one month and I would like to change resource limits to prevent users from running so much time. Besides, I'd like to set AccountingStorageEnforce to qos,safe. If I make such changes would the running jobs be stopped (the user running the jobs has still no account and therefore, should not be allowed to run anything if AccountingStorageEnforce is set)? Thanks* *
Re: [slurm-users] Changing resource limits while running jobs
And could I restart the slurmctld daemon without affecting such running jobs? On 04/01/18 15:56, Paul Edmon wrote: Typically changes like this only impact pending or newly submitted jobs. Running jobs usually are not impacted, though they will count against any new restrictions that you put in place. -Paul Edmon- On 1/4/2018 6:44 AM, Juan A. Cordero Varelaq wrote: Hi, A couple of jobs have been running for almost one month and I would like to change resource limits to prevent users from running so much time. Besides, I'd like to set AccountingStorageEnforce to qos,safe. If I make such changes would the running jobs be stopped (the user running the jobs has still no account and therefore, should not be allowed to run anything if AccountingStorageEnforce is set)? Thanks* *
[slurm-users] restrict application to a given partition
Dear Community, I have a node (20 Cores) on my HPC with two different partitions: big (16 cores) and small (4 cores). I have installed software X on this node, but I want only one partition to have rights to run it. Is it then possible to restrict the execution of an specific application to a given partition on a given node? Thanks
Re: [slurm-users] restrict application to a given partition
But what if the user knows the path to such application (let's say python command) and executes it on the partition he/she should not be allowed to? Is it possible through lua scripts to set constrains on software usage such as a limited shell, for instance? In fact, what I'd like to implement is something like a limited shell, on a particular node for a particular partition and a particular program. On 12/01/18 17:39, Paul Edmon wrote: You could do this using a job_submit.lua script that inspects for that application and routes them properly. -Paul Edmon- On 01/12/2018 11:31 AM, Juan A. Cordero Varelaq wrote: Dear Community, I have a node (20 Cores) on my HPC with two different partitions: big (16 cores) and small (4 cores). I have installed software X on this node, but I want only one partition to have rights to run it. Is it then possible to restrict the execution of an specific application to a given partition on a given node? Thanks
Re: [slurm-users] restrict application to a given partition
I ended up with a more simple solution: I tweaked the program executable (a bash script), so that it inspects which partition it is running on, and if its the wrong one, it exits. Just added the following lines: if [ $SLURM_JOB_PARTITION == 'big' ]; then exit_code=126 /bin/echo "PROGRAM failed with exit code $exit_code. PROGRAM was executed on a wrong SLURM Partition." exit $exit_code fi On 15/01/18 16:03, Paul Edmon wrote: This sounds like a solution for singularity. http://singularity.lbl.gov/ You could use the Lua script to restrict what is permitted to run via barring anything that isn't a specific singularity script. Else you could use either prolog scripts to act as emergency fall back in case the lua script doesn't catch it. -Paul Edmon- On 1/15/2018 8:31 AM, John Hearns wrote: Juan, me kne-jerk reaction is to say 'containerisation' here. However I guess that means that Slurm would have to be able to inspect the contents of a container, and I do not think that is possible. I may be very wrong here. Anyone? However have a look at thre Xalt stuff from TACC https://www.tacc.utexas.edu/research-development/tacc-projects/xalt https://github.com/Fahey-McLay/xalt Xalt is intended to instrument your cluster and collect information on what software is being run and exactly what libraries are being used. I do not think it has any options for "Nope! You may not run this executable on this partition" However it might be worth contacting the authors and discussing this. On 15 January 2018 at 14:20, Juan A. Cordero Varelaq mailto:bioinformatica-i...@us.es>> wrote: But what if the user knows the path to such application (let's say python command) and executes it on the partition he/she should not be allowed to? Is it possible through lua scripts to set constrains on software usage such as a limited shell, for instance? In fact, what I'd like to implement is something like a limited shell, on a particular node for a particular partition and a particular program. On 12/01/18 17:39, Paul Edmon wrote: You could do this using a job_submit.lua script that inspects for that application and routes them properly. -Paul Edmon- On 01/12/2018 11:31 AM, Juan A. Cordero Varelaq wrote: Dear Community, I have a node (20 Cores) on my HPC with two different partitions: big (16 cores) and small (4 cores). I have installed software X on this node, but I want only one partition to have rights to run it. Is it then possible to restrict the execution of an specific application to a given partition on a given node? Thanks
[slurm-users] constrain partition to a unique shell
Dear users, I would like to force the use of only one type of shell, let's say, bash, on a partition that shares a node with another one. Do you know if it's possible to do it? What I actually want to do is to install a limited shell (lshell) on one node and force a given partition to be able to use ONLY that shell, so that the users can run a small set of commands. Thanks
[slurm-users] Tracking costs - variable costs per partition
Hello - We're in a similar situation as was described here: https://groups.google.com/g/slurm-users/c/eBDslkwoFio where we want to track (and control) costs on a fairly heterogenous system with different billing weights per partition. The solution proposed seems like it would work rather well, except our use of fairshare seems to interfere with the billing values we would want to use to limit usage based on credits granted. We have PriorityDecayHalfLife set on our system, so that billing value (GrpTRESRaw) seems to drop with time. Is there a way to implement something similar on an otherwise fairshare-based system? Thanks, Jeff -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com