from:"A"

[slurm-users] ubuntu 16.04 > 18.04

2018-09-12 Thread A

Thinking about upgrading to Ubuntu 18.04 on my workstation, where I am
running a single node slurm setup. Any issues any one has run across in the
update?

Thanks!
ashton

[slurm-users] swap size

2018-09-21 Thread A

I have a single node slurm config on my workstation (18 cores, 256 gb ram,
40 Tb disk space). I recently just extended the array size to its current
config and am reconfiguring my LVM logical volumes.

I'm curious on people's thoughts on swap sizes for a node. Redhat these
days recommends up to 20% of ram size for swap size, but no less than 4 gb.

But..according to slurm faq;
"Suspending and resuming a job makes use of the SIGSTOP and SIGCONT signals
respectively, so swap and disk space should be sufficient to accommodate
all jobs allocated to a node, either running or suspended."

So I'm wondering if 20% is enough, or whether it should scale by the number
of single jobs I might be running at any one time. E.g. if I'm running 10
jobs that all use 20 gb of ram, and I suspend, should I need 200 gb of swap?

any thoughts?

-ashton

Re: [slurm-users] swap size

2018-09-21 Thread A

Hi John! Thanks for the reply, lots to think about.

In terms of suspending/resuming, my situation might be a bit different than
other people. As I mentioned this is an install on a single node
workstation. This is my daily office machine. I run alot of python
processing scripts that have low CPU need but lots of iterations. I found
it easier to manage these in slurm, opposed to writing mpi/parallel
processing routines in python directly.

Given this, sometimes I might submit a slurm array with 10K jobs, that
might take a week to run, but I still need to sometimes do work during the
day that requires more CPU power. In those cases I suspend the background
array, crank through whatever I need to do and then resume in the evening
when I go home. Sometimes I can say for jobs to finish, sometimes I have to
break in the middle of running jobs

On Fri, Sep 21, 2018, 10:07 PM John Hearns  wrote:

> Ashton,   on a compute node with 256Gbytes of RAM I would not
> configure any swap at all. None.
> I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM -
> and no swap.
> Also our ICE clusters were diskless - SGI very smartly configured swap
> over ISCSI - but we disabled this, the reason being that if one node
> in a job starts swapping the likelihood is that all the nodes are
> swapping, and things turn to treacle from there.
> Also, as another issue, if you have lots of RAM you need to look at
> the vm tunings for dirty ratio, background ratio and centisecs. Linux
> will aggressively cache data which is written to disk - you can get a
> situation where your processes THINK data is written to disk but it is
> cached, then what happens of there is a power loss? SO get those
> caches flushed often.
>
> https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
>
> Oh, and my other tip.  In the past vm.min_free_kbytes was ridiculously
> small on default Linux systems. I call this the 'wriggle room' when a
> system is short on RAM. Think of it like those square sliding letters
> puzzles - min_free_kbytes is the empty square which permits the letter
> tiles to move.
> SO look at your min_free_kbytes and increase it (If I'm not wrong in
> RH7 and Centos7 systems it is a reasonable value already)
> https://bbs.archlinux.org/viewtopic.php?id=184655
>
> Oh, and  it is good to keep a terminal open with 'watch cat
> /proc/meminfo'  I have spent many a happy hour staring at that when
> looking at NFS performance etc. etc.
>
> Back to your specific case. My point is that for HPC work you should
> never go into swap (with a normally running process, ie no job
> pre-emption). I find that 20 percent rule is out of date. Yes,
> probably you should have some swap on a workstation. And yes disk
> space is cheap these days.
>
>
> However, you do talk about job pre-emption and suspending/resuming
> jobs. I have never actually seen that being used in production.
> At this point I would be grateful for some education from the choir -
> is this commonly used and am I just hopelessly out of date?
> Honestly, anywhere I have managed systems, lower priority jobs are
> either allowed to finish, or in the case of F1 we checkpointed and
> killed low priority jobs manually if there was a super high priority
> job to run.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, 21 Sep 2018 at 22:34, A  wrote:
> >
> > I have a single node slurm config on my workstation (18 cores, 256 gb
> ram, 40 Tb disk space). I recently just extended the array size to its
> current config and am reconfiguring my LVM logical volumes.
> >
> > I'm curious on people's thoughts on swap sizes for a node. Redhat these
> days recommends up to 20% of ram size for swap size, but no less than 4 gb.
> >
> > But..according to slurm faq;
> > "Suspending and resuming a job makes use of the SIGSTOP and SIGCONT
> signals respectively, so swap and disk space should be sufficient to
> accommodate all jobs allocated to a node, either running or suspended."
> >
> > So I'm wondering if 20% is enough, or whether it should scale by the
> number of single jobs I might be running at any one time. E.g. if I'm
> running 10 jobs that all use 20 gb of ram, and I suspend, should I need 200
> gb of swap?
> >
> > any thoughts?
> >
> > -ashton
>
>

Re: [slurm-users] swap size

2018-09-23 Thread A

Ray

I'm also on Ubuntu. I'll try the same test, but do it with and without swap
on (e.g. by running the swapoff and swapon commands first). To complicate
things I also don't know if the swapiness level makes a difference.

Thanks
Ashton

On Sun, Sep 23, 2018, 7:48 AM Raymond Wan  wrote:

>
> Hi Chris,
>
>
> On Sunday, September 23, 2018 09:34 AM, Chris Samuel wrote:
> > On Saturday, 22 September 2018 4:19:09 PM AEST Raymond Wan wrote:
> >
> >> SLURM's ability to suspend jobs must be storing the state in a
> >> location outside of this 512 GB.  So, you're not helping this by
> >> allocating more swap.
> >
> > I don't believe that's the case.  My understanding is that in this mode
> it's
> > just sending processes SIGSTOP and then launching the incoming job so you
> > should really have enough swap for the previous job to get swapped out
> to in
> > order to free up RAM for the incoming job.
>
>
> Hmm, I'm way out of my comfort zone but I am curious
> about what happens.  Unfortunately, I don't think I'm able
> to read kernel code, but someone here
> (
> https://stackoverflow.com/questions/31946854/how-does-sigstop-work-in-linux-kernel)
>
> seems to suggest that SIGSTOP and SIGCONT moves a process
> between the runnable and waiting queues.
>
> I'm not sure if I did the correct test, but I wrote a C
> program that allocated a lot of memory:
>
> -
> #include 
>
> #define memsize 16000
>
> int main () {
>char *foo = NULL;
>
>foo = (char *) malloc (sizeof (char) * memsize);
>
>for (int i = 0; i < memsize; i++) {
>  foo[i] = 0;
>}
>
>do {
>} while (1);
> }
> -
>
> Then, I ran it and sent a SIGSTOP to it.  According to htop
> (I don't know if it's correct), it seems to still be
> occupying memory, but just not any CPU cycles.
>
> Perhaps I've done something wrong?  I did read elsewhere
> that how SIGSTOP is treated can vary from system to
> system...  I happen to be on an Ubuntu system.
>
> Ray
>
>
>
>

Re: [slurm-users] Priority wait

2017-11-13 Thread A

I'm guessing you should have sent them to cluster Decepticon, instead

In all seriousness though, provide the conf file. You might have
accidentally set a maximum number of running jobs somewhere


On Nov 13, 2017 7:28 AM, "Benjamin Redling" 
wrote:

> Hi Roy,
>
> On 11/13/17 2:37 PM, Roe Zohar wrote:
> [...]
>
>> I sent 3000 jobs with feature Optimus and part are running while part are
>> pendind. Which is ok.
>> But I have sent 1000 jobs to Megatron and they are all in pending stating
>> they wait because of priority. Whay os that?
>>
>> B.t.w if I change their priority to a higher one, they start to run on
>> Megatron.
>>
>
> my guess: is if you can provide the slurm.conf of that cluster, the
> probability anyone will sacrifice his spare-time for you will increase
> significantly.
>
> Regards,
> Benjamin
> --
> FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
> ☎ +49 3641 9 44323
>
>

[slurm-users] Changing node weights in partitions

2019-03-22 Thread José A .

Dear all,

I would like to create two partitions, A and B, in which node1 had a
certain weight in partition A and a different one in partition B. Does
anyone know how to implement it?

Thanks very much for the help!

Cheers,

José

Re: [slurm-users] Changing node weights in partitions

2019-03-22 Thread Jose A

Dear Ole,

Thanks for your fast reply. I really appreciate that. 

I had a look at your website and googled about “weight masks” but still have 
some questions. 

From your example I see that the mask definition is commented out. How to 
define what the mask means?

If helps, I’ll put an easy example. 

Node1 have more RAM and clock freq than node2. 

Partition A should start filling node1, while partition B should start filling 
node2. 

Can I accomplish this behavior through weighting the nodes? With your example 
I’m afraid to say it’s not still clear to me how. 

Thanks a lot for your help. 

José

> On 22. Mar 2019, at 16:29, Ole Holm Nielsen  
> wrote:
> 
>> On 3/22/19 4:15 PM, José A. wrote:
>> Dear all,
>> I would like to create two partitions, A and B, in which node1 had a certain 
>> weight in partition A and a different one in partition B. Does anyone know 
>> how to implement it?
> 
> Some pointers to documentation of this and a practical example is in my Wiki 
> page:
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-weight
> 
> /Ole
>

Re: [slurm-users] Changing node weights in partitions

2019-03-23 Thread Jose A

Hello Chris,

You got my point. I want a way in which a partition influences the priority 
with a node takes new jobs. 

Any tip will be really appreciated. Thanks a lot. 

Cheers,

José

> On 23. Mar 2019, at 03:38, Chris Samuel  wrote:
> 
>> On 22/3/19 12:51 pm, Ole Holm Nielsen wrote:
>> 
>> The web page explains how the weight mask is defined: Each digit in the mask 
>> defines a node property.  Please read the example given.
> 
> I don't think that's what José is asking for, he wants the weights for a node 
> to be different when being considered in one partition to when it's being 
> considered in a different partition.
> 
> I don't think you can do that though I'm afraid, José, I think the weight is 
> only attached to the node and the partition doesn't influence it.
> 
> All the best,
> Chris
> -- 
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>

Re: [slurm-users] Changing node weights in partitions

2019-03-26 Thread Jose A

Dear Ole,

Thanks for the support. I think that could help me in the following way:

1. Setting different partitions with the node groups I want to prioritize. 
2. Allowing users to submit to several partitions at the same time 
3. Through accounting, creating accounts with different priorities from one 
partition to another. That will allow that each job type, associated to an 
account, starts differently in different partitions. 
4. Once a job start in one partition, the other submitted jobs are killed and 
get out of SLURM. 

It’s a bit more work but gets the effect I am looking for: that different nodes 
prioritize different types of jobs.

Is that, specially step 4, possible!

Thanks for the help. 

José

> On 24. Mar 2019, at 21:52, Ole Holm Nielsen  
> wrote:
> 
> Hi José,
> 
>> On 23-03-2019 19:59, Jose A wrote:
>> You got my point. I want a way in which a partition influences the priority 
>> with a node takes new jobs.
>> Any tip will be really appreciated. Thanks a lot.
> 
> Would PriorityWeightPartition as defined with the Multifactor Priority Plugin 
> (https://slurm.schedmd.com/priority_multifactor.html) help you?
> 
> See also my summary in 
> https://wiki.fysik.dtu.dk/niflheim/Slurm_scheduler#multifactor-priority-plugin-scheduler
> 
> /Ole
>

[slurm-users] SLURM in Virtual Machine

2019-09-12 Thread Jose A

Dear all,

In the expansion of our Cluster we are considering to install SLURM within a 
virtual machine in order to simplify updates and reconfigurations.

Does any of your have experience running SLURM in VMs? I would really 
appreciate if you could share your ideas and experiences. 

Thanks a lot. 

Cheers

José

Re: [slurm-users] SLURM in Virtual Machine

2019-09-12 Thread Jose A.

Dear all,

thank you for your fast feedback. My initial idea was to run slurmctld and
slurmdb in respective KVMs and running while keeping the worker nodes
physical. From what I see that is a setup that works without problem.

However, I also find interesting some of the suggestions that you
mentioned, like having worker nodes in VMs for testing and compilation
purposes or even the login node. I will give some thoughts about that.

Thanks a lot for the support. You are great.

-- 
José

On 12. September 2019 at 19:46:36, Brian Andrus (toomuc...@gmail.com) wrote:

Well, technically I have run several clusters all in VMs because it is
all in the cloud.

I think the main issue would be how resources are allocated and the
need. Given the choice, I would not run nodes in VMs because the
hypervisor inherently adds overhead that could be used for compute.
However, there are definite use cases that make it worthwhile.

So long as you allocate enough resources for the node (be it the
controller or other) you will be fine.

Brian Andrus

On 9/12/2019 7:23 AM, Jose A wrote:
> Dear all,
>
> In the expansion of our Cluster we are considering to install SLURM
within a virtual machine in order to simplify updates and reconfigurations.
>
> Does any of your have experience running SLURM in VMs? I would really
appreciate if you could share your ideas and experiences.
>
> Thanks a lot.
>
> Cheers
>
> José

[slurm-users] srun seg faults immediately from within sbatch but not salloc

2018-05-07 Thread a . vitalis

Dear all,

I am trying to set up a small cluster running slurm on Ubuntu 16.04.
I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared partition. 
Installation seems fine. Munge is taken from the system package.
Something like this:
./configure --prefix=/software/slurm/slurm-17.11.5 
--exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix --with-munge=/usr 
--sysconfdir=/software/slurm/etc

One of the nodes is also the control host and runs both slurmctld and slurmd 
(but the issue is there also if this is not the case). I start daemons manually 
at the moment (slurmctld first).
My configuration file looks like this (I removed the node-specific parts):

SlurmdUser=root
#
AuthType=auth/munge
# Epilog=/usr/local/slurm/etc/epilog
FastSchedule=1
JobCompLoc=/var/log/slurm/slurm.job.log
JobCompType=jobcomp/filetxt
JobCredentialPrivateKey=/usr/local/etc/slurm.key
JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
#PluginDir=/usr/local/slurm/lib/slurm
# Prolog=/usr/local/slurm/etc/prolog
SchedulerType=sched/backfill
SelectType=select/linear
SlurmUser=cadmin # this user exists everywhere
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdTimeout=300
SwitchType=switch/none
TreeWidth=50
#
# logging
StateSaveLocation=/var/log/slurm/tmp
SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool
SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid
SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h
#
# job settings
MaxTasksPerNode=64
MpiDefault=pmix_v2

# plugins
TaskPlugin=task/cgroup


There are no prolog or epilog scripts.
After some fiddling with MPI, I got the system to work with interactive jobs  
through salloc (MPI behaves correctly for jobs occupying one or all of  the 
nodes). sinfo produces expected results.
However, as soon as I try to submit through sbatch I get an instantaneous seg 
fault regardless of executable (even when there is none specified, i.e., the 
srun command is meaningless).

When I try to monitor slurmd in the foreground (- -D), I get something like 
this:

slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on 
this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for 
this node
slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2
slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
slurmd: debug:  task/cgroup: loaded
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /software/slurm/etc/plugstack.conf
slurmd: Munge cryptographic signature plugin loaded
slurmd: slurmd version 17.11.5 started
slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: slurmd started on Mon, 07 May 2018 23:54:31 +0200
slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=64062 
TmpDisk=187611 Uptime=1827335 CPUSpecList=(null) FeaturesAvail=(null) 
FeaturesActive=(null)
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/software/slurm/etc/acct_gather.conf)
slurmd: debug2: got this type of message 4005
slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
slurmd: debug2: _group_cache_lookup_internal: no entry found for andreas
slurmd: _run_prolog: run job script took usec=5
slurmd: _run_prolog: prolog with lock for job 100 ran for 0 seconds
slurmd: Launching batch job 100 for UID 1003
slurmd: debug2: got this type of message 6011
slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
slurmd: debug:  _rpc_terminate_job, uid = 1001
slurmd: debug:  credential for job 100 revoked
slurmd: debug2: No steps in jobid 100 to send signal 999
slurmd: debug2: No steps in jobid 100 to send signal 18
slurmd: debug2: No steps in jobid 100 to send signal 15
slurmd: debug2: set revoke expiration for jobid 100 to 1525730207 UTS
slurmd: debug2: got this type of message 1008

Here, job 100 would be a submission script with something like:

#!/bin/bash -l
#SBATCH --job-name=FSPMXX
#SBATCH --output=/storage/andreas/camp3.out
#SBATCH --error=/storage/andreas/camp3.err
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1 --tasks-per-node=32 --ntasks-per-core=1
 #SBATCH

Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc

2018-05-08 Thread a . vitalis

Dear all,

I tried to debug this with some apparent success (for now).

If anyone cares:
With the help of gdb inside sbatch, I tracked down the immediate seg fault to 
strcmp.
I then hacked src/srun/srun.c with some info statements and isolated this 
function as the culprit:
static void _setup_env_working_cluster(void)

With my configuration, this routine ended up performing a strcmp of two NULL 
pointers, which seg-faults on our system (and is not language-compliant I would 
think?). My current understanding is that this is a slurm bug.

The issue is rectifiable by simply giving the cluster a name in slurm.conf 
(e.g., ClusterName=bla). I am not using slurmdbd btw.

Hope this helps,
Andreas


-"slurm-users"  wrote: -
To: slurm-users@lists.schedmd.com
From: a.vita...@bioc.uzh.ch
Sent by: "slurm-users" 
Date: 05/08/2018 12:44AM
Subject: [slurm-users] srun seg faults immediately from within sbatch but   
not salloc

Dear all,

I am trying to set up a small cluster running slurm on Ubuntu 16.04.
I installed slurm-17.11.5 along with pmix-2.1.1 on an NFS-shared partition. 
Installation seems fine. Munge is taken from the system package.
Something like this:
./configure --prefix=/software/slurm/slurm-17.11.5 
--exec-prefix=/software/slurm/Gnu --with-pmix=/software/pmix --with-munge=/usr 
--sysconfdir=/software/slurm/etc

One of the nodes is also the control host and runs both slurmctld and slurmd 
(but the issue is there also if this is not the case). I start daemons manually 
at the moment (slurmctld first).
My configuration file looks like this (I removed the node-specific parts):

SlurmdUser=root
#
AuthType=auth/munge
# Epilog=/usr/local/slurm/etc/epilog
FastSchedule=1
JobCompLoc=/var/log/slurm/slurm.job.log
JobCompType=jobcomp/filetxt
JobCredentialPrivateKey=/usr/local/etc/slurm.key
JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
#PluginDir=/usr/local/slurm/lib/slurm
# Prolog=/usr/local/slurm/etc/prolog
SchedulerType=sched/backfill
SelectType=select/linear
SlurmUser=cadmin # this user exists everywhere
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdTimeout=300
SwitchType=switch/none
TreeWidth=50
#
# logging
StateSaveLocation=/var/log/slurm/tmp
SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool
SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid
SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h
#
# job settings
MaxTasksPerNode=64
MpiDefault=pmix_v2

# plugins
TaskPlugin=task/cgroup


There are no prolog or epilog scripts.
After some fiddling with MPI, I got the system to work with interactive jobs  
through salloc (MPI behaves correctly for jobs occupying one or all of  the 
nodes). sinfo produces expected results.
However, as soon as I try to submit through sbatch I get an instantaneous seg 
fault regardless of executable (even when there is none specified, i.e., the 
srun command is meaningless).

When I try to monitor slurmd in the foreground (- -D), I get something like 
this:

slurmd: debug:  Log file re-opened
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: topology NONE plugin loaded
slurmd: route default plugin loaded
slurmd: CPU frequency setting not configured for this node
slurmd: debug:  Resource spec: No specialized cores configured by default on 
this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for 
this node
slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: debug:  CPUs:64 Boards:1 Sockets:2 CoresPerSocket:16 ThreadsPerCore:2
slurmd: debug:  Reading cgroup.conf file /software/slurm/etc/cgroup.conf
slurmd: debug:  task/cgroup: loaded
slurmd: debug:  Munge authentication plugin loaded
slurmd: debug:  spank: opening plugin stack /software/slurm/etc/plugstack.conf
slurmd: Munge cryptographic signature plugin loaded
slurmd: slurmd version 17.11.5 started
slurmd: debug:  Job accounting gather NOT_INVOKED plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: debug:  switch NONE plugin loaded
slurmd: slurmd started on Mon, 07 May 2018 23:54:31 +0200
slurmd: CPUs=64 Boards=1 Sockets=2 Cores=16 Threads=2 Memory=64062 
TmpDisk=187611 Uptime=1827335 CPUSpecList=(null) FeaturesAvail=(null) 
FeaturesActive=(null)
slurmd: debug:  AcctGatherEnergy NONE plugin loaded
slurmd: debug:  AcctGatherProfile NONE plugin loaded
slurmd: debug:  AcctGatherInterconnect NONE plugin loaded
slurmd: debug:  AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/software/slurm/etc/acct_gather.conf)
slurmd: debug2: got this type of message 4005
slurmd: debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
slurmd: debug2: _group_cache_lookup_internal: no ent

Re: [slurm-users] srun seg faults immediately from within sbatch but not salloc

2018-05-09 Thread a . vitalis

Hi Benjamin,

thanks for getting back to me! I somehow failed to ever arrive at this page.

Andreas

-"slurm-users"  wrote: -
To: slurm-users@lists.schedmd.com
From: Benjamin Matthews 
Sent by: "slurm-users" 
Date: 05/09/2018 01:20AM
Subject: Re: [slurm-users] srun seg faults immediately from within sbatch but 
not salloc

   I think this should already be fixed in the upcoming 
release. See: 
https://github.com/SchedMD/slurm/commit/947bccd2c5c7344e6d09dab565e2cc6663eb9e72

On 5/8/18 12:08 PM,   a.vita...@bioc.uzh.ch wrote:
  Dear all,

 I tried to debug this with some apparent success (for now).

 If anyone cares:
 With the help of gdb inside sbatch, I tracked down the immediate   
  seg fault to strcmp.
 I then hacked src/srun/srun.c with some info statements and 
isolated this function as the culprit:
 static   void _setup_env_working_cluster(void)

 With my configuration, this routine ended up performing a 
strcmp of two NULL pointers, which seg-faults on our system (and is not 
language-compliant I would think?). My current understanding is that 
this is a slurm bug.

 The issue is rectifiable by simply giving the cluster a name in
 slurm.conf (e.g., ClusterName=bla). I am not using slurmdbd btw.

 Hope this helps,
 Andreas

 -"slurm-users"
wrote: - 

To: slurm-users@lists.schedmd.com
 From: a.vita...@bioc.uzh.ch
 Sent by: "slurm-users" 
   Date: 05/08/2018 12:44AM
   Subject: [slurm-users] srun seg faults immediately from  
 within sbatch but not salloc

   Dear all,

 I am trying to set up a small cluster running slurm on 
Ubuntu 16.04.
 I installed slurm-17.11.5 along with pmix-2.1.1 on an  
   NFS-shared partition. Installation seems fine. Munge is 
taken from the system package.
 Something like this:
 ./configure   
--prefix=/software/slurm/slurm-17.11.5   
--exec-prefix=/software/slurm/Gnu   --with-pmix=/software/pmix 
--with-munge=/usr   --sysconfdir=/software/slurm/etc

 One of the nodes is also the control host and runs both
 slurmctld and slurmd (but the issue is there also if 
this is not the case). I start daemons manually at the moment 
(slurmctld first).
 My configuration file looks like this (I removed the   
  node-specific parts):

 SlurmdUser=root
   #
   AuthType=auth/munge
   # Epilog=/usr/local/slurm/etc/epilog
   FastSchedule=1
   JobCompLoc=/var/log/slurm/slurm.job.log
   JobCompType=jobcomp/filetxt
   JobCredentialPrivateKey=/usr/local/etc/slurm.key
 JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
   #PluginDir=/usr/local/slurm/lib/slurm
   # Prolog=/usr/local/slurm/etc/prolog
   SchedulerType=sched/backfill
   SelectType=select/linear
   SlurmUser=cadmin # this user exists everywhere
   SlurmctldPort=7002
   SlurmctldTimeout=300
   SlurmdPort=7003
   SlurmdTimeout=300
   SwitchType=switch/none
   TreeWidth=50
   #
   # logging
   StateSaveLocation=/var/log/slurm/tmp
   SlurmdSpoolDir=/var/log/slurm/tmp/slurmd.%n.spool
   SlurmctldPidFile=/var/log/slurm/var/run/slurmctld.pid
   SlurmdPidFile=/var/log/slurm/var/run/slurmd.%n.pid
   SlurmctldLogFile=/var/log/slurm/slurmctld.log
   SlurmdLogFile=/var/log/slurm/slurmd.%n.log.%h
   #
   # job settings
   MaxTasksPerNode=64
   MpiDefault=pmix_v2

   # plugins
   TaskPlugin=task/cgroup

 There are no prolog or epilog scripts.
 After some fiddling with MPI, I got the system to work 
with interactive jobs through salloc (MPI behaves 
correctly for jobs occupying one or all of the nodes). sinfo 
produces expected results.
 However, as soon as I try to submit through sbatch I get   
  an instantaneous seg fault regardless o

[slurm-users] Seff error with Slurm-18.08.1

2018-10-23 Thread Miguel A . Sánchez

Hi all

I have updated my slurm from the 17.11.0 version to the 18.08.1. With
the previous version, the 17.11.0 version, the seff tool was working
fine but with the 18.08.1 version, when I try to run the seff tool I
receive the next error message:

# ./seff 
perl: error: plugin_load_from_file:
dlopen(/usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so):
/usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so:
undefined symbol: node_record_count
perl: error: Couldn't load specified plugin name for
accounting_storage/slurmdbd: Dlopen of plugin file failed
perl: error: cannot create accounting_storage context for
accounting_storage/slurmdbd
perl: error: plugin_load_from_file:
dlopen(/usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so):
/usr/local/slurm-18.08.2/lib/slurm/accounting_storage_slurmdbd.so:
undefined symbol: node_record_count
perl: error: Couldn't load specified plugin name for
accounting_storage/slurmdbd: Dlopen of plugin file failed
perl: error: cannot create accounting_storage context for
accounting_storage/slurmdbd
Job not found.
#

Both Slurm installations has been compiled from sources in the same
computer but only the seff that was compiled in the 17.11.0 version
works fine. To compile the seff tool, from the source Slurm tree:

cd contrib

make

make install

I think the problem is in the perlapi. Could it be a bug? Any Idea about
how can I fix this problem? Thanks a lot.


-- 

Miguel A. Sánchez Gómez
System Administrator
Research Programme on Biomedical Informatics - GRIB (IMIM-UPF)

Barcelona Biomedical Research Park (office 4.80)
Doctor Aiguader 88 | 08003 Barcelona (Spain)
Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550
e-mail: miguelangel.sanc...@upf.edu

Re: [slurm-users] Seff error with Slurm-18.08.1

2018-11-08 Thread Miguel A . Sánchez

Hi and thanks for all your answers and sorry for the delay in my answer.
Yesterday I have installed in the controller machine the Slurm-18.08.3
to check if with this last release the Seff command is working fine. The
behavior has improve but I still receive a error message:


# /usr/local/slurm-18.08.3/bin/seff 1694112
*Use of uninitialized value $lmem in numeric lt (<) at
/usr/local/slurm-18.08.3/bin/seff line 130,  line 624.*
Job ID: 1694112
Cluster: X
User/Group: X
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 01:39:33
CPU Efficiency: 4266.43% of 00:02:20 core-walltime
Job Wall-clock time: 00:01:10
Memory Utilized: 0.00 MB (estimated maximum)
Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node)
[root@hydra ~]#


And due to this problem,  any job shows me as memory utilized the value
of 0.00 MB.


With slurm-17.11.1 is working fine:


# /usr/local/slurm-17.11.0/bin/seff 1694112
Job ID: 1694112
Cluster: X
User/Group: X
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 01:39:33
CPU Efficiency: 4266.43% of 00:02:20 core-walltime
Job Wall-clock time: 00:01:10
Memory Utilized: 2.44 GB
Memory Efficiency: 62.57% of 3.91 GB
[root@hydra bin]#




Miguel A. Sánchez Gómez
System Administrator
Research Programme on Biomedical Informatics - GRIB (IMIM-UPF)

Barcelona Biomedical Research Park (office 4.80)
Doctor Aiguader 88 | 08003 Barcelona (Spain)
Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550
e-mail: miguelangel.sanc...@upf.edu

On 11/06/2018 06:30 PM, Mike Cammilleri wrote:
>
> Thanks for this. We'll try the workaround script. It is not
> mission-critical but our users have gotten accustomed to seeing these
> metrics at the end of each run and its nice to have. We are currently
> doing this in a test VM environment, so by the time we actually do the
> upgrade to the cluster perhaps the fix will be available then.
>
>
> Mike Cammilleri
>
> Systems Administrator
>
> Department of Statistics | UW-Madison
>
> 1300 University Ave | Room 1280
> 608-263-6673 | mi...@stat.wisc.edu
>
>
>
> 
> *From:* slurm-users  on behalf
> of Chris Samuel 
> *Sent:* Tuesday, November 6, 2018 5:03 AM
> *To:* slurm-users@lists.schedmd.com
> *Subject:* Re: [slurm-users] Seff error with Slurm-18.08.1
>  
> On 6/11/18 7:49 pm, Baker D.J. wrote:
>
> > The good new is that I am assured by SchedMD that the bug has been
> fixed
> > in v18.08.3.
>
> Looks like it's fixed in this commmit.
>
> commit 3d85c8f9240542d9e6dfb727244e75e449430aac
> Author: Danny Auble 
> Date:   Wed Oct 24 14:10:12 2018 -0600
>
>  Handle symbol resolution errors in the 18.08 slurmdbd.
>
>  Caused by b1ff43429f6426c when moving the slurmdbd agent internals.
>
>  Bug 5882.
>
>
> > Having said that we will probably live with this issue
> > rather than disrupt users with another upgrade so soon .
>
> An upgrade to 18.08.3 from 18.08.1 shouldn't be disruptive though,
> should it?  We just flip a symlink and the users see the new binaries,
> libraries, etc immediately, we can then restart daemons as and when we
> need to (in the right order of course, slurmdbd, slurmctld and then
> slurmd's).
>
> All the best,
> Chris
> -- 
>   Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>

Re: [slurm-users] Seff error with Slurm-18.08.1

2018-11-09 Thread Miguel A . Sánchez

Oh, thanks Paddy for your patch, it works very well !!

Miguel A. Sánchez Gómez
System Administrator
Research Programme on Biomedical Informatics - GRIB (IMIM-UPF)

Barcelona Biomedical Research Park (office 4.80)
Doctor Aiguader 88 | 08003 Barcelona (Spain)
Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550
e-mail: miguelangel.sanc...@upf.edu

On 11/09/2018 07:59 AM, Marcus Wagner wrote:
> Thanks Paddy,
>
> just something learned again ;)
>
>
> Best
> Marcus
>
> On 11/08/2018 05:07 PM, Paddy Doyle wrote:
>> Hi all,
>>
>> It looks like we can use the api to avoid having to manually parse
>> the '2='
>> value from the stats{tres_usage_in_max} value.
>>
>> I've submitted a bug report and patch:
>>
>> https://bugs.schedmd.com/show_bug.cgi?id=6004
>>
>> The minimal changes needed would be in the attched seff.patch.
>>
>> Hope that helps,
>>
>> Paddy
>>
>> On Thu, Nov 08, 2018 at 11:54:59AM +0100, Marcus Wagner wrote:
>>
>>> Hi Miguel,
>>>
>>>
>>> this is because SchedMD changed the stats field. There exists no more
>>> rss_max, cmp. line 225 of seff.
>>> You need to evaluate the field stats{tres_usage_in_max}, and there
>>> the value
>>> after '2=', but this is the memory value in bytes instead of kbytes,
>>> so this
>>> should be divided by 1024 additionally.
>>>
>>>
>>> Best
>>> Marcus
>>>
>>> On 11/08/2018 11:06 AM, Miguel A. Sánchez wrote:
>>>> Hi and thanks for all your answers and sorry for the delay in my
>>>> answer.
>>>> Yesterday I have installed in the controller machine the Slurm-18.08.3
>>>> to check if with this last release the Seff command is working
>>>> fine. The
>>>> behavior has improve but I still receive a error message:
>>>>
>>>>
>>>> # /usr/local/slurm-18.08.3/bin/seff 1694112
>>>> *Use of uninitialized value $lmem in numeric lt (<) at
>>>> /usr/local/slurm-18.08.3/bin/seff line 130,  line 624.*
>>>> Job ID: 1694112
>>>> Cluster: X
>>>> User/Group: X
>>>> State: COMPLETED (exit code 0)
>>>> Nodes: 1
>>>> Cores per node: 2
>>>> CPU Utilized: 01:39:33
>>>> CPU Efficiency: 4266.43% of 00:02:20 core-walltime
>>>> Job Wall-clock time: 00:01:10
>>>> Memory Utilized: 0.00 MB (estimated maximum)
>>>> Memory Efficiency: 0.00% of 3.91 GB (3.91 GB/node)
>>>> [root@hydra ~]#
>>>>
>>>>
>>>> And due to this problem,  any job shows me as memory utilized the
>>>> value
>>>> of 0.00 MB.
>>>>
>>>>
>>>> With slurm-17.11.1 is working fine:
>>>>
>>>>
>>>> # /usr/local/slurm-17.11.0/bin/seff 1694112
>>>> Job ID: 1694112
>>>> Cluster: X
>>>> User/Group: X
>>>> State: COMPLETED (exit code 0)
>>>> Nodes: 1
>>>> Cores per node: 2
>>>> CPU Utilized: 01:39:33
>>>> CPU Efficiency: 4266.43% of 00:02:20 core-walltime
>>>> Job Wall-clock time: 00:01:10
>>>> Memory Utilized: 2.44 GB
>>>> Memory Efficiency: 62.57% of 3.91 GB
>>>> [root@hydra bin]#
>>>>
>>>>
>>>>
>>>>
>>>> Miguel A. Sánchez Gómez
>>>> System Administrator
>>>> Research Programme on Biomedical Informatics - GRIB (IMIM-UPF)
>>>>
>>>> Barcelona Biomedical Research Park (office 4.80)
>>>> Doctor Aiguader 88 | 08003 Barcelona (Spain)
>>>> Phone: +34/ 93 316 0522 | Fax: +34/ 93 3160 550
>>>> e-mail:miguelangel.sanc...@upf.edu
>>>> On 11/06/2018 06:30 PM, Mike Cammilleri wrote:
>>>>> Thanks for this. We'll try the workaround script. It is not
>>>>> mission-critical but our users have gotten accustomed to seeing
>>>>> these metrics at the end of each run and its nice to have. We are
>>>>> currently doing this in a test VM environment, so by the time we
>>>>> actually do the upgrade to the cluster perhaps the fix will be
>>>>> available then.
>>>>>
>>>>>
>>>>> Mike Cammilleri
>>>>>
>>>>> Systems Administrator
>>>>>
>>>>> Department of Statistics | UW-Madison
>>>>>
>>>>> 1300 University Ave | Room 1280
>>>>> 608-26

[slurm-users] slurm password -what is impact when changing it

2020-09-14 Thread Braun, Ruth A

Is there any issue if I set/change the slurm account password?I'm running 
19.05.x

Current state is locked but I have to reset it periodically:
# passwd --status slurm
slurm LK 2014-02-03 -1 -1 -1 -1 (Password locked.)


Best Regards,
RB

[slurm-users] Memory per CPU

2020-09-29 Thread Luecht, Jeff A

I am working on my first ever SLURM cluster build for use as a resource manager 
in a JupyterHub Development environment.  I have configured the cluster for 
SelectType of 'select/con_res' with DefMemPerCPU and MaxMemPerCPU of 16Gb.  The 
idea is to essentially provide for jobs that run in a 1 CPU/16Gb chunks.  This 
is a starting point for us.

What I am seeing is that when users submit jobs and ask for memory only  - in 
this case, 16Gb, SLURM actually allocates 2 CPUs, not 1 that I would expect.  
Is my understanding of how this particular configuration works incorrect?



The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com

Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-29 Thread Luecht, Jeff A

The following is the pertinent information for our cluster and the job run.  
Note: server names, IP addresses and user IDs are anonymized. 

Slurm.conf
==
TaskPlugin=task/affinity

# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

# Memory Management
DefMemPerCPU=16384
MaxMemPerCPU=16384


NodeName=linuxnode1 NodeAddr=99.999.999.999 CPUs=4 RealMemory=49152 
State=UNKNOWN
NodeName=linuxnode2 NodeAddr=99.999.999.999 CPUs=4 RealMemory=49152 
State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP


Job SBATCH file

#!/bin/bash
#SBATCH --job-name=HadoopTest  # Job name
#SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, 
FAIL, ALL)
#SBATCH --mail-user=** # Where to send mail 
#SBATCH --mem=16gb   # Job memory request
#SBATCH --time=08:00:00 # Time limit hrs:min:sec
#SBATCH --output=logs/slurm_test_%j.log  # Standard output and error log
pwd; hostname; date


echo "Running sbatch-HadoopTest script"

kinit 
cd /projects
python HiveValidation.py
python ImpalaValidation.py
python SparkTest.py

date


scontrol output
===
JobId=334 JobName=HadoopTest
   UserId=** GroupId=** MCS_label=N/A
   Priority=4294901604 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:06 TimeLimit=08:00:00 TimeMin=N/A
   SubmitTime=2020-09-29T10:40:09 EligibleTime=2020-09-29T10:40:09
   AccrueTime=2020-09-29T10:40:09
   StartTime=2020-09-29T10:40:10 EndTime=2020-09-29T18:40:10 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-29T10:40:10
   Partition=debug AllocNode:Sid=lpae138a:41279
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=linuxnode2
   BatchHost=linuxnode2
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=16G,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=16G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/projects/sbatch-HadoopTest.sh
   WorkDir=/projects
   StdErr=/projects/logs/slurm_test_334.log
   StdIn=/dev/null
   StdOut=/projects/logs/slurm_test_334.log
   Power=
   MailUser=** MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Michael Di Domenico
Sent: Tuesday, September 29, 2020 10:20 AM
To: Slurm User Community List 
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

what leads you to believe that you're getting 2 CPU's instead of 1?
'scontrol show job ' would be a helpful first start.

On Tue, Sep 29, 2020 at 9:56 AM Luecht, Jeff A  wrote:
>
> I am working on my first ever SLURM cluster build for use as a resource 
> manager in a JupyterHub Development environment.  I have configured the 
> cluster for SelectType of ‘select/con_res’ with DefMemPerCPU and MaxMemPerCPU 
> of 16Gb.  The idea is to essentially provide for jobs that run in a 1 
> CPU/16Gb chunks.  This is a starting point for us.
>
>
>
> What I am seeing is that when users submit jobs and ask for memory only  – in 
> this case, 16Gb, SLURM actually allocates 2 CPUs, not 1 that I would expect.  
> Is my understanding of how this particular configuration works incorrect?
>
>
> The contents of this email are the property of PNC. If it was not addressed 
> to you, you have no legal right to read it. If you think you received it in 
> error, please notify the sender. Do not forward or copy without permission of 
> the sender. This message may be considered a commercial electronic message 
> under Canadian law or this message may contain an advertisement of a product 
> or service and thus may constitute a commercial electronic mail message under 
> US law. You may unsubscribe at any time from receiving commercial electronic 
> messages from PNC at http://pages.e.pnc.com/globalunsub/
> PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com
>




The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of

Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-29 Thread Luecht, Jeff A

There are three pieces of information that may provide Useful:

1 - these are VMs and not physical servers
2 - the OS is RedHat 7.8
2 - As far as I can tell, hyperthreading is not enabled, but will check for sure
3 - when we ask for 15Gb memory - we will only get 1 CPU

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Michael Di Domenico
Sent: Tuesday, September 29, 2020 10:20 AM
To: Slurm User Community List 
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

what leads you to believe that you're getting 2 CPU's instead of 1?
'scontrol show job ' would be a helpful first start.

On Tue, Sep 29, 2020 at 9:56 AM Luecht, Jeff A  wrote:
>
> I am working on my first ever SLURM cluster build for use as a resource 
> manager in a JupyterHub Development environment.  I have configured the 
> cluster for SelectType of ‘select/con_res’ with DefMemPerCPU and MaxMemPerCPU 
> of 16Gb.  The idea is to essentially provide for jobs that run in a 1 
> CPU/16Gb chunks.  This is a starting point for us.
>
>
>
> What I am seeing is that when users submit jobs and ask for memory only  – in 
> this case, 16Gb, SLURM actually allocates 2 CPUs, not 1 that I would expect.  
> Is my understanding of how this particular configuration works incorrect?
>
>
> The contents of this email are the property of PNC. If it was not addressed 
> to you, you have no legal right to read it. If you think you received it in 
> error, please notify the sender. Do not forward or copy without permission of 
> the sender. This message may be considered a commercial electronic message 
> under Canadian law or this message may contain an advertisement of a product 
> or service and thus may constitute a commercial electronic mail message under 
> US law. You may unsubscribe at any time from receiving commercial electronic 
> messages from PNC at http://pages.e.pnc.com/globalunsub/
> PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com
>

The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com

Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-29 Thread Luecht, Jeff A

Here are the particulars asked for.

The following is the pertinent information for our cluster and the job run.  
Note: server names, IP addresses and user IDs are anonymized. 

Slurm.conf
==
TaskPlugin=task/affinity

# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

# Memory Management
DefMemPerCPU=16384
MaxMemPerCPU=16384


NodeName=linuxnode1 NodeAddr=99.999.999.999 CPUs=4 RealMemory=49152 
State=UNKNOWN
NodeName=linuxnode2 NodeAddr=99.999.999.999 CPUs=4 RealMemory=49152 
State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP


Job SBATCH file

#!/bin/bash
#SBATCH --job-name=HadoopTest  # Job name
#SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, 
FAIL, ALL)
#SBATCH --mail-user=** # Where to send mail 
#SBATCH --mem=16gb   # Job memory request
#SBATCH --time=08:00:00 # Time limit hrs:min:sec
#SBATCH --output=logs/slurm_test_%j.log  # Standard output and error log
pwd; hostname; date


echo "Running sbatch-HadoopTest script"

kinit 
cd /projects
python HiveValidation.py
python ImpalaValidation.py
python SparkTest.py

date


scontrol output
===
JobId=334 JobName=HadoopTest
   UserId=** GroupId=** MCS_label=N/A
   Priority=4294901604 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:06 TimeLimit=08:00:00 TimeMin=N/A
   SubmitTime=2020-09-29T10:40:09 EligibleTime=2020-09-29T10:40:09
   AccrueTime=2020-09-29T10:40:09
   StartTime=2020-09-29T10:40:10 EndTime=2020-09-29T18:40:10 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-09-29T10:40:10
   Partition=debug AllocNode:Sid=lpae138a:41279
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=linuxnode2
   BatchHost=linuxnode2
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=16G,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=16G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/projects/sbatch-HadoopTest.sh
   WorkDir=/projects
   StdErr=/projects/logs/slurm_test_334.log
   StdIn=/dev/null
   StdOut=/projects/logs/slurm_test_334.log
   Power=
   MailUser=** MailType=BEGIN,END,FAIL,REQUEUE,STAGE_OUT

-Original Message-
From: slurm-users [mailto:slurm-users-bounces at lists.schedmd.com] On Behalf 
Of Michael Di Domenico
Sent: Tuesday, September 29, 2020 10:20 AM
To: Slurm User Community List 
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

what leads you to believe that you're getting 2 CPU's instead of 1?
'scontrol show job ' would be a helpful first start.

On Tue, Sep 29, 2020 at 9:56 AM Luecht, Jeff A  wrote:
>
> I am working on my first ever SLURM cluster build for use as a resource 
> manager in a JupyterHub Development environment.  I have configured the 
> cluster for SelectType of ‘select/con_res’ with DefMemPerCPU and MaxMemPerCPU 
> of 16Gb.  The idea is to essentially provide for jobs that run in a 1 
> CPU/16Gb chunks.  This is a starting point for us.
>
>
>
> What I am seeing is that when users submit jobs and ask for memory only  – in 
> this case, 16Gb, SLURM actually allocates 2 CPUs, not 1 that I would expect.  
> Is my understanding of how this particular configuration works incorrect?
>
>
> The contents of this email are the property of PNC. If it was not addressed 
> to you, you have no legal right to read it. If you think you received it in 
> error, please notify the sender. Do not forward or copy without permission of 
> the sender. This message may be considered a commercial electronic message 
> under Canadian law or this message may contain an advertisement of a product 
> or service and thus may constitute a commercial electronic mail message under 
> US law. You may unsubscribe at any time from receiving commercial electronic 
> messages from PNC at http://pages.e.pnc.com/globalunsub/
> PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com
>




The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or t

Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-30 Thread Luecht, Jeff A

First off, I want to thank everyone for their input and suggestions.  They were 
very helpful an ultimately pointed me in the right direction.  I spent several 
hours playing around with various settings.  

Some additional background. When the srun command is used to execute this job,  
we do not see this issue.  We only see it in SBATCH.

What I ultimate did was the following:

1 - Change the NodeName to add the specific parameters Sockets, Cores and 
Threads.
2 - Changed the DefMemPerCPU/MaxMemCPU to 16144/12228 instead of 6000/12000 
respectively

I tested jobs after the above changes and used 'scontrol --defaults job ' 
command.  The CPU allocation now works as expected.  

I do have one question though - what is the benefit/recommendation of using 
srun to execute a process within SBATCH.  We are running primarily python jobs, 
but need to also support R jobs.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Wednesday, September 30, 2020 2:18 AM
To: Slurm User Community List ; Michael Di 
Domenico 
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Il 29/09/20 16:19, Michael Di Domenico ha scritto:

> what leads you to believe that you're getting 2 CPU's instead of 1?
I think I saw that too, once, but thought it was related to hyperthreading.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 
Bologna - Italy
tel.: +39 051 20 95786




The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com

Re: [slurm-users] EXTERNAL: Re: Memory per CPU

2020-09-30 Thread Luecht, Jeff A

So just to confirm, there is not inherent issue using srun within an SBATCH 
file?

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Ryan Novosielski
Sent: Wednesday, September 30, 2020 10:01 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] EXTERNAL: Re: Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Primary one I’m aware of is that resource use is better reported (or at all in 
some cases) via srun, and srun can take care of MPI for an MPI job.  I’m sure 
there are others as well (I guess avoiding another place where you have to 
describe the resources to be used and making sure they match, in the case of 
mpirun, etc.).
--

|| \\UTGERS,   
|---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 30, 2020, at 09:38, Luecht, Jeff A 
mailto:jeff.lue...@pnc.com>> wrote:
First off, I want to thank everyone for their input and suggestions.  They 
were very helpful an ultimately pointed me in the right direction.  I spent 
several hours playing around with various settings.

Some additional background. When the srun command is used to execute this job,  
we do not see this issue.  We only see it in SBATCH.

What I ultimate did was the following:

1 - Change the NodeName to add the specific parameters Sockets, Cores and 
Threads.
2 - Changed the DefMemPerCPU/MaxMemCPU to 16144/12228 instead of 6000/12000 
respectively

I tested jobs after the above changes and used 'scontrol --defaults job ' 
command.  The CPU allocation now works as expected.

I do have one question though - what is the benefit/recommendation of using 
srun to execute a process within SBATCH.  We are running primarily python jobs, 
but need to also support R jobs.

-Original Message-
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Diego Zuccato
Sent: Wednesday, September 30, 2020 2:18 AM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>; Michael 
Di Domenico mailto:mdidomeni...@gmail.com>>
Subject: EXTERNAL: Re: [slurm-users] Memory per CPU

** This email has been received from outside the organization – Think before 
clicking on links, opening attachments, or responding. **

Il 29/09/20 16:19, Michael Di Domenico ha scritto:

what leads you to believe that you're getting 2 CPU's instead of 1?
I think I saw that too, once, but thought it was related to hyperthreading.

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 
Bologna - Italy
tel.: +39 051 20 95786

The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com

The contents of this email are the property of PNC. If it was not addressed to 
you, you have no legal right to read it. If you think you received it in error, 
please notify the sender. Do not forward or copy without permission of the 
sender. This message may be considered a commercial electronic message under 
Canadian law or this message may contain an advertisement of a product or 
service and thus may constitute a commercial electronic mail message under US 
law. You may unsubscribe at any time from receiving commercial electronic 
messages from PNC at http://pages.e.pnc.com/globalunsub/
PNC, 249 Fifth Avenue, Pittsburgh, PA 15222; pnc.com

Re: [slurm-users] Quick hold on all partitions, all jobs

2017-11-08 Thread Jonathon A Anderson

In your situation, where you're blocking user access to the login node, it 
probably doesn't matter. We use DOWN in most events, as INACTIVE would prevent 
new jobs from being queued against the partition at all. DOWN allows the jobs 
to be queued, and just doesn't permit them to run. (In either case, HOLDing 
PENDING jobs is redundant.)

~jonathon


From: slurm-users  on behalf of Lachlan 
Musicman 
Sent: Wednesday, November 8, 2017 5:00:12 PM
To: Slurm User Community List
Subject: [slurm-users] Quick hold on all partitions, all jobs

The IT team sent an email saying "complete network wide network outage tomorrow 
night from 10pm across the whole institute".

Our plan is to put all queued jobs on hold, suspend all running jobs, and 
turning off the login node.

I've just discovered that the partitions have a state, and it can be set to UP, 
DOWN, DRAIN or INACTIVE.

In this situation - most likely a 4 hour outage with nothing else affected - 
would you mark your partitions DOWN or INACTIVE?

Ostensibly all users should be off the systems (because no network), but 
there's always one that sets an at or cron job or finds that corner case.

Cheers
L.


--
"The antidote to apocalypticism is apocalyptic civics. Apocalyptic civics is 
the insistence that we cannot ignore the truth, nor should we panic about it. 
It is a shared consciousness that our institutions have failed and our 
ecosystem is collapsing, yet we are still here — and we are creative agents who 
can shape our destinies. Apocalyptic civics is the conviction that the only way 
out is through, and the only way through is together. "

Greg Bloom @greggish https://twitter.com/greggish/status/873177525903609857

[slurm-users] slurm.spec-legacy - how to invoke

2017-12-21 Thread Braun, Ruth A

Can someone provide an example of using the rpmbuild command while specifying 
the slurm.spec-legacy file?
I need to build the new version of slurm for RHEL6 and need to invoke the 
slurm.spec-legacy file (if possible) on this command line:

# rpmbuild -tb slurm-17.11.1.tar.bz2

Regards, Ruth

R. Braun
Sr IT Analyst
HPC

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-23 Thread James A. Peltier

We put SSSD caches on a RAMDISK which helped a little bit with performance.

- On 22 Jan, 2018, at 02:38, Alessandro Federico a.feder...@cineca.it wrote:

| Hi John,
| 
| just an update...
| we not have a solution for the SSSD issue yet, but we changed the ACL
| on the 2 partitions from AllowGroups=g2 to AllowAccounts=g2 and the
| slowdown has gone.
| 
| Thanks for the help
| ale
| 
| - Original Message -
|> From: "Alessandro Federico" 
|> To: "John DeSantis" 
|> Cc: hpc-sysmgt-i...@cineca.it, "Slurm User Community List"
|> , "Isabella Baccarelli"
|> 
|> Sent: Wednesday, January 17, 2018 5:41:54 PM
|> Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv
|> operation
|> 
|> Hi John
|> 
|> thanks for the infos.
|> We are investigating the slowdown of sssd and I found some bug
|> reports regarding slow sssd query
|> with almost the same backtrace. Hopefully an update of sssd could
|> solve this issue.
|> 
|> We'll let you know if we found a solution.
|> 
|> thanks
|> ale
|> 
|> - Original Message -
|> > From: "John DeSantis" 
|> > To: "Alessandro Federico" 
|> > Cc: "Slurm User Community List" ,
|> > "Isabella Baccarelli" ,
|> > hpc-sysmgt-i...@cineca.it
|> > Sent: Wednesday, January 17, 2018 3:30:43 PM
|> > Subject: Re: [slurm-users] slurm 17.11.2: Socket timed out on
|> > send/recv operation
|> > 
|> > Ale,
|> > 
|> > > As Matthieu said it seems something related to SSS daemon.
|> > 
|> > That was a great catch by Matthieu.
|> > 
|> > > Moreover, only 3 SLURM partitions have the AllowGroups ACL
|> > 
|> > Correct, which may seem negligent, but after each `scontrol
|> > reconfigure`, slurmctld restart, and/or AllowGroups= partition
|> > update,
|> > the mapping of UID's for each group will be updated.
|> > 
|> > > So why does the UID-GID mapping take so long?
|> > 
|> > We attempted to use "AllowGroups" previously, but we found (even
|> > with
|> > sssd.conf tuning regarding enumeration) that unless the group was
|> > local
|> > (/etc/group), we were experiencing delays before the AllowGroups
|> > parameter was respected.  This is why we opted to use SLURM's
|> > AllowQOS/AllowAccounts instead.
|> > 
|> > You would have to enable debugging on your remote authentication
|> > software to see where the hang-up is occurring (if it is that at
|> > all,
|> > and not just a delay with the slurmctld).
|> > 
|> > Given the direction that this is going - why not replace the
|> > "AllowGroups" with either a simple "AllowAccounts=" or "AllowQOS="?
|> > 
|> > > @John: we defined many partitions on the same nodes but in the
|> > > production cluster they will be more or less split across the 6K
|> > > nodes.
|> > 
|> > Ok, that makes sense.  Looking initially at your partition
|> > definitions,
|> > I immediately thought of being DRY, especially since the "finer"
|> > tuning
|> > between the partitions could easily be controlled via the QOS'
|> > allowed
|> > to access the resources.
|> > 
|> > John DeSantis
|> > 
|> > On Wed, 17 Jan 2018 13:20:49 +0100
|> > Alessandro Federico  wrote:
|> > 
|> > > Hi Matthieu & John
|> > > 
|> > > this is the backtrace of slurmctld during the slowdown
|> > > 
|> > > (gdb) bt
|> > > #0  0x7fb0e8b1e69d in poll () from /lib64/libc.so.6
|> > > #1  0x7fb0e8617bfa in sss_cli_make_request_nochecks ()
|> > > from /lib64/libnss_sss.so.2 #2  0x7fb0e86185a3 in
|> > > sss_nss_make_request () from /lib64/libnss_sss.so.2 #3
|> > > 0x7fb0e8619104 in _nss_sss_getpwnam_r ()
|> > > from /lib64/libnss_sss.so.2 #4  0x7fb0e8aef07d in
|> > > getpwnam_r@@GLIBC_2.2.5 () from /lib64/libc.so.6 #5
|> > > 0x7fb0e9360986 in _getpwnam_r (result=,
|> > > bufsiz=, buf=, pwd=,
|> > > name=) at uid.c:73 #6  uid_from_string
|> > > (name=0x1820e41
|> > > "g2bottin", uidp=uidp@entry=0x7fff07f03a6c) at uid.c:111 #7
|> > > 0x0043587d in get_group_members (group_name=0x10ac500
|> > > "g2")
|> > > at groups.c:139 #8  0x0047525a in _get_groups_members
|> > > (group_names=) at partition_mgr.c:2006 #9
|> > > 0x00475505 in _update_part_uid_access_list
|> > > (x=0x7fb03401e650,
|> > > arg=0x7fff07f13bf4) at partition_mgr.c:1930 #10
|>

[slurm-users] Areas for improvement on our site's cluster scheduling

2018-05-07 Thread Jonathon A Anderson

We have two main issues with our scheduling policy right now. The first is an 
issue that we call "queue stuffing." The second is an issue with interactive 
job availability. We aren't confused about why these issues exist, but we 
aren't sure the best way to address them.

I'd love to hear any suggestions on how other sites address these issues. 
Thanks for any advice!


## Queue stuffing

We use multifactor scheduling to provide account-based fairshare scheduling as 
well as standard fifo-style job aging. In general, this works pretty well, and 
accounts meet their scheduling targets; however, every now and again, we have a 
user who has a relatively high-throughput (not HPC) workload that they're 
willing to wait a significant period of time for. They're low-priority work, 
but they put a few thousand jobs into the queue, and just sit and wait. 
Eventually the job aging makes the jobs so high-priority, compared to the 
fairshare, that they all _as a set_ become higher-priority than the rest of the 
work on the cluster. Since they continue to age as the other jobs continue to 
age, these jobs end up monopolizing the cluster for days at a time, as their 
high volume of relatively small jobs use up a greater and greater percentage of 
the machine.

In Moab I'd address this by limiting the number of jobs the user could have 
*eligible* at any given time; but it appears that the only option for slurm is 
limiting the number of jobs a user can *submit*, which isn't as nice a user 
experience and can lead to some pathological user behaviors (like users running 
cron jobs that wake repeatedly and submit more jobs automatically).


## Interactive job availability

I'm becoming increasingly convinced that holding some portion of our resource 
aside as dedicated for relatively short, small, interactive jobs is a unique 
good; but I'm not sure how best to implement it. My immediate thought was to 
use a reservation with the DAILY and REPLACE flags. I particularly like the 
idea of using the REPLACE flag here as we could keep a flexible amount of 
resources available irrespective of how much was actually being used for the 
purpose at any given time; but it doesn't appear that there's any way to limit 
the per-user use of resources *within* a reservation; so if we created such a 
reservation and granted all users access to it, any individual user would be 
capable of consuming all resources in the reservation anyway. I'd have a 
dedicated "interactive" qos or similar to put such restrictions on; but there 
doesn't appear to be a way to then limit the use of the reservation to only 
jobs with that qos. (Aside from job_submit scripts or similar. Please correct 
me if I'm wrong.)

In lieu of that, I'm leaning towards having a dedicated interactive partition 
that we'd manually move some resources to; but that's a bit less flexible.

Re: [slurm-users] Getting nodes in a partition

2018-05-18 Thread SLIM, HENK A.

Does

sinfo -h -O nodehost -p partition | sort

help?

Also 

scontrol show hostname nodelist

where nodelist is compute-0-\[4-6\]

would work.

Regards

Henk


-Original Message-
From: slurm-users  On Behalf Of Mahmood 
Naderan
Sent: 18 May 2018 08:12
To: Slurm User Community List 
Subject: [slurm-users] Getting nodes in a partition

Hi,
Is there any slurm variable to read the node names of a partition?
There is an MPI option --hostfile  which we can write the node names.
I want to use something like this in the sbatch script:


#SBATCH --partition=MYPART
... --hostfile $SLURM_NODES_IN_PARTITION 

I can manually manipulate the output of scontrol to extract node
names. Like this:

[mahmood@rocks7 ~]$ scontrol show partition MYPART | grep -w Nodes |
cut -d '=' -f 2
compute-0-[4-6]

But that basically is not
compute-0-4
compute-0-5
compute-0-6

Which I have to post process more. Any better idea?



Regards,
Mahmood

[slurm-users] allocate more resources for a current interactive job

2018-06-18 Thread Juan A. Cordero Varelaq


Dear Slurm users,

Is it possible to allocate more resources for a current job on an 
interactive shell? I just allocate (by default) 1 core and 2Gb RAM:


srun -I -p main --pty /bin/bash

The node and queue where the job is located has 120 Gb and 4 cores 
available.


I just want to use more cores and more RAM for such shell session 
without having to terminate it.


Thanks in advance

[slurm-users] Slurm Environment Variable for Memory

2018-08-17 Thread Juan A. Cordero Varelaq


Dear Community,

does anyone know whether there is an environment variable, such as 
$SLURM_CPUS_ON_NODE, but for the requested RAM (by using --mem argument)?



Thanks

Re: [slurm-users] Slurm Environment Variable for Memory

2018-08-19 Thread Juan A. Cordero Varelaq

That variable does not exist somehow on my environment. Is it possible 
my Slurm version (17.02.3) does not include it?


Thanks
On 17/08/18 11:04, Bjørn-Helge Mevik wrote:

Yes.  It is documented in sbatch(1):

SLURM_MEM_PER_CPU
   Same as --mem-per-cpu

SLURM_MEM_PER_NODE
   Same as --mem

Re: [slurm-users] Slurm Environment Variable for Memory

2018-08-20 Thread Juan A. Cordero Varelaq

I am just running an interactive job with "srun -I --pty /bin/bash" and 
then run "echo $SLURM_MEM_PER_NODE", but it shows nothing. Does it have 
to be defined in any conf file?



On 20/08/18 09:59, Chris Samuel wrote:

On Monday, 20 August 2018 4:43:57 PM AEST Juan A. Cordero Varelaq wrote:


That variable does not exist somehow on my environment. Is it possible
my Slurm version (17.02.3) does not include it?

They should be there, from the NEWS file they were introduced in 2.3.0.rc1.
Is something else nuking your shells environment perhaps?

17.02.11 is the last released version of 17.02.x and all previous versions
have been pulled from the SchedMD website due to CVE-2018-10995.

cheers,
Chris

[slurm-users] Installing GPU Features of Slurm 20

2020-06-22 Thread Petrillo, Neale A. (Contractor)

Hi!

I'm trying to install Slurm 20.02 on my cluster with the GPU features. However, 
only my compute nodes have GPUs attached and so when I try to install the 
slurm-slurmctld RPM on my head node it fails saying it requires the NVIDIA 
control software. How do other folks work around this? Do you have one RPM 
built for the control node and one for the compute nodes? Do you just force the 
install on the control node?

Thanks!

[slurm-users] slurmdbd does not work

2021-12-02 Thread Giuseppe G. A. Celano

Hi everyone,

I am having trouble getting *slurmdbd* to work. This is the error I get:




*error: Couldn't find the specified plugin name for
accounting_storage/mysql looking at all fileserror: cannot find
accounting_storage plugin for accounting_storage/mysqlerror: cannot create
accounting_storage context for accounting_storage/mysqlfatal: Unable to
initialize accounting_storage/mysql accounting storage plugin*

I have installed *mysql* (*apt install mysql*) on Ubuntu 20.04.03 and
followed the instructions on the slurm website
<https://slurm.schedmd.com/accounting.html>; *mysql* is running (*port 3306*)
and these are the relevant parts in my *.conf* files:

*slurm.conf*

# LOGGING AND ACCOUNTING
AccountingStorageHost=localhost
AccountingStoragePort=3306
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log

*slurmdbd.conf*

AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
DbdPort=3306
LogFile=/var/log/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
PluginDir=/usr/lib/slurm
SlurmUser=slurm
StoragePass=password
StorageType=accounting_storage/mysql
StorageUser=slurm
StorageLoc=slurm_acct_db

I changed the port to 3306 because otherwise *slurmdbd *could not
communicate with *mysql*. If I run *sacct*, for example, I get:










*sacct: error: _slurm_persist_recv_msg: read of fd 3 failed: No errorsacct:
error: _slurm_persist_recv_msg: only read 126 of 2616 bytessacct: error:
slurm_persist_conn_open: No response to persist_initsacct: error: Sending
PersistInit msg: No errorJobID   JobName  PartitionAccount
 AllocCPUS  State ExitCode  -- --
-- -- --  sacct: error:
_slurm_persist_recv_msg: read of fd 3 failed: No errorsacct: error:
_slurm_persist_recv_msg: only read 126 of 2616 bytessacct: error: Sending
PersistInit msg: No errorsacct: error: DBD_GET_JOBS_COND failure:
Unspecified error*

Does anyone have a suggestion to solve this problem? Thank you very much.

Best,
Giuseppe

Re: [slurm-users] slurmdbd does not work

2021-12-03 Thread Giuseppe G. A. Celano

Thanks for the answer, Brian. I now added
--with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and
now also slurmctld does not work, with the error:

[2021-12-03T15:36:41.018] accounting_storage/slurmdbd:
clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817
with slurmdbd
[2021-12-03T15:36:41.019] error: _conn_readable: persistent connection for
fd 9 experienced error[104]: Connection reset by peer
[2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150 of
2613 bytes
[2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error
[2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for
fd 9 experienced error[104]: Connection reset by peer
[2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of
2613 bytes
[2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error
[2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for
fd 9 experienced error[104]: Connection reset by peer
[2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of
2613 bytes
[2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error
[2021-12-03T15:36:41.020] error: DBD_GET_TRES failure: No error
[2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for
fd 9 experienced error[104]: Connection reset by peer
[2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 0 of
2613 bytes
[2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error
[2021-12-03T15:36:41.021] error: DBD_GET_QOS failure: No error
[2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for
fd 9 experienced error[104]: Connection reset by peer
[2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 150 of
2613 bytes
[2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error
[2021-12-03T15:36:41.021] error: DBD_GET_USERS failure: No error
[2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for
fd 9 experienced error[104]: Connection reset by peer
[2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of
2613 bytes
[2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error
[2021-12-03T15:36:41.022] error: DBD_GET_ASSOCS failure: No error
[2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for
fd 9 experienced error[104]: Connection reset by peer
[2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of
2613 bytes
[2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error
[2021-12-03T15:36:41.022] error: DBD_GET_RES failure: No error
[2021-12-03T15:36:41.022] fatal: You are running with a database but for
some reason we have no TRES from it.  This should only happen if the
database is down and you don't have any state files.

On Thu, Dec 2, 2021 at 10:36 PM Brian Andrus  wrote:

>
> Your slurm needs built with the support. If you have mysql-devel installed
> it should pick it up, otherwise you can specify the location with
> --with-mysql when you configure/build slurm
>
> Brian Andrus
> On 12/2/2021 12:40 PM, Giuseppe G. A. Celano wrote:
>
> Hi everyone,
>
> I am having trouble getting *slurmdbd* to work. This is the error I get:
>
>
>
>
> *error: Couldn't find the specified plugin name for
> accounting_storage/mysql looking at all files error: cannot find
> accounting_storage plugin for accounting_storage/mysql error: cannot create
> accounting_storage context for accounting_storage/mysql fatal: Unable to
> initialize accounting_storage/mysql accounting storage plugin*
>
> I have installed *mysql* (*apt install mysql*) on Ubuntu 20.04.03 and
> followed the instructions on the slurm website
> <https://slurm.schedmd.com/accounting.html>; *mysql* is running (*port
> 3306*) and these are the relevant parts in my *.conf* files:
>
> *slurm.conf*
>
> # LOGGING AND ACCOUNTING
> AccountingStorageHost=localhost
> AccountingStoragePort=3306
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageUser=slurm
> JobCompType=jobcomp/none
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/linux
> SlurmctldDebug=info
> SlurmctldLogFile=/var/log/slurmctld.log
> SlurmdDebug=info
> SlurmdLogFile=/var/log/slurmd.log
>
> *slurmdbd.conf*
>
> AuthType=auth/munge
> DbdAddr=localhost
> DbdHost=localhost
> DbdPort=3306
> LogFile=/var/log/slurmdbd.log
> PidFile=/var/run/slurmdbd.pid
> PluginDir=/usr/lib/slurm
> SlurmUser=slurm
> StoragePass=password
> StorageType=accounting_storage/mysql
> StorageUser=slurm
> StorageLoc=slurm_acct_db
>
> I changed the port to 3306 because otherwise *slurmdbd *could not
> communicate with *mysql*. If I run *sacct*, for example, I get:
>
>
>
>
>
>
>
>
>
>
> *sacct: error: _slurm_persist_recv_msg: read of fd 3 failed: No error
> sacct

Re: [slurm-users] slurmdbd does not work

2021-12-03 Thread Giuseppe G. A. Celano

The problem is the lack of /usr/lib/slurm/accounting_storage_mysql.so

I have installed many mariadb-related packages, but that file is not
created by slurm after installation: is there a point in the documentation
where the installation procedure for the database is made explicit?



On Fri, Dec 3, 2021 at 5:15 PM Brian Andrus  wrote:

> You will need to also reinstall/restart slurmdbd with the updated binary.
>
> Look in the slurmdbd logs to see what is happening there. I suspect it had
> errors updating/creating the database and tables. If you have no data in it
> yet, you can just DROP the database and restart slurmdbd.
>
> Brian Andrus
> On 12/3/2021 6:42 AM, Giuseppe G. A. Celano wrote:
>
> Thanks for the answer, Brian. I now added
> --with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and
> now also slurmctld does not work, with the error:
>
> [2021-12-03T15:36:41.018] accounting_storage/slurmdbd:
> clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817
> with slurmdbd
> [2021-12-03T15:36:41.019] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.020] error: DBD_GET_TRES failure: No error
> [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 0 of
> 2613 bytes
> [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.021] error: DBD_GET_QOS failure: No error
> [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.021] error: DBD_GET_USERS failure: No error
> [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of
> 2613 bytes
> [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.022] error: DBD_GET_ASSOCS failure: No error
> [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of
> 2613 bytes
> [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.022] error: DBD_GET_RES failure: No error
> [2021-12-03T15:36:41.022] fatal: You are running with a database but for
> some reason we have no TRES from it.  This should only happen if the
> database is down and you don't have any state files.
>
>
>
> On Thu, Dec 2, 2021 at 10:36 PM Brian Andrus  wrote:
>
>>
>> Your slurm needs built with the support. If you have mysql-devel
>> installed it should pick it up, otherwise you can specify the location with
>> --with-mysql when you configure/build slurm
>>
>> Brian Andrus
>> On 12/2/2021 12:40 PM, Giuseppe G. A. Celano wrote:
>>
>> Hi everyone,
>>
>> I am having trouble getting *slurmdbd* to work. This is the error I get:
>>
>>
>>
>>
>> *error: Couldn't find the specified plugin name for
>> accounting_storage/mysql looking at all files error: cannot find
>> accounting_storage plugin for accounting_storage/mysql error: cannot create
>> accounting_storage context for accounting_storage/mysql fatal: Unable to
>> initialize accounting_storage/mysql accounting storage plugin*
>>
>> I have installed *mysql* (*apt install mysql*) on Ubuntu 20.04.03 and
>> followed the instructions on the slurm website
>> <https://slurm.schedmd.com/accounting.html>; *mysql* is running (*port
>> 3306*) and these are the relevant parts in my *.conf

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-03 Thread Giuseppe G. A. Celano

After installation of libmariadb-dev, I have reinstalled the entire slurm
with ./configure + options, make, and make install. Still,
accounting_storage_mysql.so is missing.



On Sat, Dec 4, 2021 at 12:24 AM Sean Crosby  wrote:

> Did you run
>
> ./configure (with any other options you normally use)
> make
> make install
>
> on your DBD server after you installed the mariadb-devel package?
>
> --
> *From:* slurm-users  on behalf of
> Giuseppe G. A. Celano 
> *Sent:* Saturday, 4 December 2021 10:07
> *To:* Slurm User Community List 
> *Subject:* [EXT] Re: [slurm-users] slurmdbd does not work
>
> * External email: Please exercise caution *
> --
> The problem is the lack of /usr/lib/slurm/accounting_storage_mysql.so
>
> I have installed many mariadb-related packages, but that file is not
> created by slurm after installation: is there a point in the documentation
> where the installation procedure for the database is made explicit?
>
>
>
> On Fri, Dec 3, 2021 at 5:15 PM Brian Andrus  wrote:
>
> You will need to also reinstall/restart slurmdbd with the updated binary.
>
> Look in the slurmdbd logs to see what is happening there. I suspect it had
> errors updating/creating the database and tables. If you have no data in it
> yet, you can just DROP the database and restart slurmdbd.
>
> Brian Andrus
> On 12/3/2021 6:42 AM, Giuseppe G. A. Celano wrote:
>
> Thanks for the answer, Brian. I now added
> --with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and
> now also slurmctld does not work, with the error:
>
> [2021-12-03T15:36:41.018] accounting_storage/slurmdbd:
> clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817
> with slurmdbd
> [2021-12-03T15:36:41.019] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.020] error: DBD_GET_TRES failure: No error
> [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 0 of
> 2613 bytes
> [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.021] error: DBD_GET_QOS failure: No error
> [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.021] error: DBD_GET_USERS failure: No error
> [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of
> 2613 bytes
> [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.022] error: DBD_GET_ASSOCS failure: No error
> [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of
> 2613 bytes
> [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.022] error: DBD_GET_RES failure: No error
> [2021-12-03T15:36:41.022] fatal: You are running with a database but for
> some reason we have no TRES from it.  This should only happen if the
> database is down and you don't have any state files.
>
>
>
> On Thu, Dec 2, 2021 at 10:36 PM Brian Andrus  wrote:
>
>
> Your slurm needs built with the support. If you have mysql-devel installed
> it should pick it up, otherwise you can specify the location with
> --with-mysql when you configure/build slurm
>
> Brian Andrus
> On 12/2/2021 12:40 PM, Giuseppe G. A. Celano wrote:
>
> Hi everyone,
>
> I am having trouble g

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-03 Thread Giuseppe G. A. Celano

10.4.22


On Sat, Dec 4, 2021 at 1:35 AM Brian Andrus  wrote:

> Which version of Mariadb are you using?
>
> Brian Andrus
> On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote:
>
> After installation of libmariadb-dev, I have reinstalled the entire slurm
> with ./configure + options, make, and make install. Still,
> accounting_storage_mysql.so is missing.
>
>
>
> On Sat, Dec 4, 2021 at 12:24 AM Sean Crosby 
> wrote:
>
>> Did you run
>>
>> ./configure (with any other options you normally use)
>> make
>> make install
>>
>> on your DBD server after you installed the mariadb-devel package?
>>
>> --
>> *From:* slurm-users  on behalf of
>> Giuseppe G. A. Celano 
>> *Sent:* Saturday, 4 December 2021 10:07
>> *To:* Slurm User Community List 
>> *Subject:* [EXT] Re: [slurm-users] slurmdbd does not work
>>
>> * External email: Please exercise caution *
>> --
>> The problem is the lack of /usr/lib/slurm/accounting_storage_mysql.so
>>
>> I have installed many mariadb-related packages, but that file is not
>> created by slurm after installation: is there a point in the documentation
>> where the installation procedure for the database is made explicit?
>>
>>
>>
>> On Fri, Dec 3, 2021 at 5:15 PM Brian Andrus  wrote:
>>
>> You will need to also reinstall/restart slurmdbd with the updated binary.
>>
>> Look in the slurmdbd logs to see what is happening there. I suspect it
>> had errors updating/creating the database and tables. If you have no data
>> in it yet, you can just DROP the database and restart slurmdbd.
>>
>> Brian Andrus
>> On 12/3/2021 6:42 AM, Giuseppe G. A. Celano wrote:
>>
>> Thanks for the answer, Brian. I now added
>> --with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and
>> now also slurmctld does not work, with the error:
>>
>> [2021-12-03T15:36:41.018] accounting_storage/slurmdbd:
>> clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817
>> with slurmdbd
>> [2021-12-03T15:36:41.019] error: _conn_readable: persistent connection
>> for fd 9 experienced error[104]: Connection reset by peer
>> [2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150
>> of 2613 bytes
>> [2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error
>> [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection
>> for fd 9 experienced error[104]: Connection reset by peer
>> [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150
>> of 2613 bytes
>> [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error
>> [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection
>> for fd 9 experienced error[104]: Connection reset by peer
>> [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150
>> of 2613 bytes
>> [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error
>> [2021-12-03T15:36:41.020] error: DBD_GET_TRES failure: No error
>> [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection
>> for fd 9 experienced error[104]: Connection reset by peer
>> [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 0 of
>> 2613 bytes
>> [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error
>> [2021-12-03T15:36:41.021] error: DBD_GET_QOS failure: No error
>> [2021-12-03T15:36:41.021] error: _conn_readable: persistent connection
>> for fd 9 experienced error[104]: Connection reset by peer
>> [2021-12-03T15:36:41.021] error: _slurm_persist_recv_msg: only read 150
>> of 2613 bytes
>> [2021-12-03T15:36:41.021] error: Sending PersistInit msg: No error
>> [2021-12-03T15:36:41.021] error: DBD_GET_USERS failure: No error
>> [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection
>> for fd 9 experienced error[104]: Connection reset by peer
>> [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of
>> 2613 bytes
>> [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error
>> [2021-12-03T15:36:41.022] error: DBD_GET_ASSOCS failure: No error
>> [2021-12-03T15:36:41.022] error: _conn_readable: persistent connection
>> for fd 9 experienced error[104]: Connection reset by peer
>> [2021-12-03T15:36:41.022] error: _slurm_persist_recv_msg: only read 0 of
>> 2613 bytes
>> [2021-12-03T15:36:41.022] error: Sending PersistInit msg: No error
>> [2021-12-03T15:36:41.022] error: DBD_GET_RES failure: No error
>> [2021-12-03T15:36:41.022] fatal: You are running with a d

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-03 Thread Giuseppe G. A. Celano

I have installed almost all of the possible packages, but that file doesn't
show up:

libdbd-mariadb-perl/focal,now 1.11-3ubuntu2 amd64 [installed]
libmariadb-dev-compat/unknown,now 1:10.4.22+maria~focal amd64 [installed]
libmariadb-dev/unknown,now 1:10.4.22+maria~focal amd64 [installed]
libmariadb3-compat/unknown,now 1:10.4.22+maria~focal amd64 [installed]
libmariadb3/unknown,now 1:10.4.22+maria~focal amd64 [installed,automatic]
libmariadbclient18/unknown,now 1:10.4.22+maria~focal amd64 [installed]
libmariadbd-dev/unknown,now 1:10.4.22+maria~focal amd64 [installed]
libmariadbd19/unknown,now 1:10.4.22+maria~focal amd64 [installed]
mariadb-client-10.4/unknown,now 1:10.4.22+maria~focal amd64
[installed,automatic]
mariadb-client-core-10.4/unknown,now 1:10.4.22+maria~focal amd64 [installed]
mariadb-client/unknown,unknown,unknown,now 1:10.4.22+maria~focal all
[installed]
mariadb-common/unknown,unknown,unknown,now 1:10.4.22+maria~focal all
[installed]
mariadb-plugin-connect/unknown,now 1:10.4.22+maria~focal amd64 [installed]
mariadb-server-10.4/unknown,now 1:10.4.22+maria~focal amd64 [installed]
mariadb-server-core-10.4/unknown,now 1:10.4.22+maria~focal amd64 [installed]
mariadb-server/unknown,unknown,unknown,now 1:10.4.22+maria~focal all
[installed]
odbc-mariadb/focal,now 3.1.4-1 amd64 [installed]


On Sat, Dec 4, 2021 at 2:06 AM Sean Crosby  wrote:

> Try installing the libmariadb-dev-compat package and trying the
> configure/make again. It provides "libmysqlclient.so", whereas
> libmariadb-dev provides "libmariadb.so"
> --
> *From:* slurm-users  on behalf of
> Giuseppe G. A. Celano 
> *Sent:* Saturday, 4 December 2021 11:40
> *To:* Slurm User Community List 
> *Subject:* Re: [slurm-users] [EXT] Re: slurmdbd does not work
>
> * External email: Please exercise caution *
> --
> 10.4.22
>
>
> On Sat, Dec 4, 2021 at 1:35 AM Brian Andrus  wrote:
>
> Which version of Mariadb are you using?
>
> Brian Andrus
> On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote:
>
> After installation of libmariadb-dev, I have reinstalled the entire slurm
> with ./configure + options, make, and make install. Still,
> accounting_storage_mysql.so is missing.
>
>
>
> On Sat, Dec 4, 2021 at 12:24 AM Sean Crosby 
> wrote:
>
> Did you run
>
> ./configure (with any other options you normally use)
> make
> make install
>
> on your DBD server after you installed the mariadb-devel package?
>
> --
> *From:* slurm-users  on behalf of
> Giuseppe G. A. Celano 
> *Sent:* Saturday, 4 December 2021 10:07
> *To:* Slurm User Community List 
> *Subject:* [EXT] Re: [slurm-users] slurmdbd does not work
>
> * External email: Please exercise caution *
> --
> The problem is the lack of /usr/lib/slurm/accounting_storage_mysql.so
>
> I have installed many mariadb-related packages, but that file is not
> created by slurm after installation: is there a point in the documentation
> where the installation procedure for the database is made explicit?
>
>
>
> On Fri, Dec 3, 2021 at 5:15 PM Brian Andrus  wrote:
>
> You will need to also reinstall/restart slurmdbd with the updated binary.
>
> Look in the slurmdbd logs to see what is happening there. I suspect it had
> errors updating/creating the database and tables. If you have no data in it
> yet, you can just DROP the database and restart slurmdbd.
>
> Brian Andrus
> On 12/3/2021 6:42 AM, Giuseppe G. A. Celano wrote:
>
> Thanks for the answer, Brian. I now added
> --with-mysql_config=/etc/mysql/my.cnf, but the problem is still there and
> now also slurmctld does not work, with the error:
>
> [2021-12-03T15:36:41.018] accounting_storage/slurmdbd:
> clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817
> with slurmdbd
> [2021-12-03T15:36:41.019] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.019] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.019] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.020] error: _conn_readable: persistent connection for
> fd 9 experienced error[104]: Connection reset by peer
> [2021-12-03T15:36:41.020] error: _slurm_persist_recv_msg: only read 150 of
> 2613 bytes
> [2021-12-03T15:36:41.020] error: Sending PersistInit msg: No error
> [2021-12-03T15:36:41.02

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-04 Thread Giuseppe G. A. Celano

Hi Gennaro,

That helped: slurm-wlm has accounting_storage_mysql.so, and I moved it to
the location requested by the first slurm installation. Everything seems to
work, even if I had to change the location of the .conf files, probably
because this is required by the new slurm-wlm installation. I am not sure
whether I should try to uninstall my previous installation and reinstall
slurm-wlm...

On Sat, Dec 4, 2021 at 12:38 PM Gennaro Oliva 
wrote:

> Ciao Giuseppe,
>
> On Sat, Dec 04, 2021 at 02:30:40AM +0100, Giuseppe G. A. Celano wrote:
> > I have installed almost all of the possible packages, but that file
> doesn't
> > show up:
>
> can you please specify what options are you using with ./configure?
>
> If you don't specify any prefix (--prefix option), the default location
> for your installation is /usr/local, so you should find the plugins under
> /usr/local/lib/slurm
>
> Did you tried the slurm-wlm package shipped with ubuntu?
> It comes with the mysql plugin.
> Best regards
> --
> Gennaro Oliva
>
>

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-05 Thread Giuseppe G. A. Celano

Hi,

I have reinstalled slurm using the ubuntu package slurm-wlm (and some
related ones). After solving some problems with the directories where the
pid files are stored (I keep getting the message "Can't open PID file
/run/slurm/slurmd.pid (yet?) after start: Operation not permitted", even if
the directory has slurm as owner and group). The services slurmdbd,
slurmctld, and slurmd work, but I cannot use the commands sinfo, srun,
etc.. because I get the errors:

sinfo: symbol lookup error: sinfo: undefined symbol: slurm_conf
srun: symbol lookup error: srun: undefined symbol: xfree_ptr
sacct: symbol lookup error: sacct: undefined symbol:
slurm_destroy_selected_step

Does anyone know the reason for that? Thanks.

Best,
Giuseppe


On Sat, Dec 4, 2021 at 5:31 PM Giuseppe G. A. Celano <
giuseppegacel...@gmail.com> wrote:

> Hi Gennaro,
>
> That helped: slurm-wlm has accounting_storage_mysql.so, and I moved it to
> the location requested by the first slurm installation. Everything seems to
> work, even if I had to change the location of the .conf files, probably
> because this is required by the new slurm-wlm installation. I am not sure
> whether I should try to uninstall my previous installation and reinstall
> slurm-wlm...
>
>
> On Sat, Dec 4, 2021 at 12:38 PM Gennaro Oliva 
> wrote:
>
>> Ciao Giuseppe,
>>
>> On Sat, Dec 04, 2021 at 02:30:40AM +0100, Giuseppe G. A. Celano wrote:
>> > I have installed almost all of the possible packages, but that file
>> doesn't
>> > show up:
>>
>> can you please specify what options are you using with ./configure?
>>
>> If you don't specify any prefix (--prefix option), the default location
>> for your installation is /usr/local, so you should find the plugins under
>> /usr/local/lib/slurm
>>
>> Did you tried the slurm-wlm package shipped with ubuntu?
>> It comes with the mysql plugin.
>> Best regards
>> --
>> Gennaro Oliva
>>
>>

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-06 Thread Giuseppe G. A. Celano

Grazie Gennaro,

It's working!




On Mon, Dec 6, 2021 at 9:41 AM Gennaro Oliva  wrote:

> Ciao Giuseppe,
>
> On Mon, Dec 06, 2021 at 03:46:02AM +0100, Giuseppe G. A. Celano wrote:
> > sinfo: symbol lookup error: sinfo: undefined symbol: slurm_conf
> > srun: symbol lookup error: srun: undefined symbol: xfree_ptr
> > sacct: symbol lookup error: sacct: undefined symbol:
> > slurm_destroy_selected_step
> >
> > Does anyone know the reason for that? Thanks.
>
> please check that you are using the client tools from the slurm package
> and not those coming from the source installation. The command:
>
> which srun
>
> should return /usr/bin/srun and not /usr/local/bin/srun
>
> In the latter case remove everyting related to slurm under /usr/local
>
> /usr/local/share/doc/slurm*
> /usr/local/sbin/slurm*
> /usr/local/lib/libslurm*
> /usr/local/lib/slurm
> /usr/local/include/slurm
>
> /usr/local/bin/scancel
> /usr/local/bin/sprio
> /usr/local/bin/sdiag
> /usr/local/bin/srun
> /usr/local/bin/squeue
> /usr/local/bin/sbcast
> /usr/local/bin/sview
> /usr/local/bin/salloc
> /usr/local/bin/scontrol
> /usr/local/bin/sreport
> /usr/local/bin/sbatch
> /usr/local/bin/strigger
> /usr/local/bin/sacctmgr
> /usr/local/bin/sacct
> /usr/local/bin/sattach
> /usr/local/bin/scrontab
> /usr/local/bin/sh5util
> /usr/local/bin/sstat
> /usr/local/bin/sinfo
> /usr/local/bin/sshare
>
> Look also for files under:
>
> /usr/local/share/man/
>
> Best regards,
> --
> Gennaro Oliva
>
>

[slurm-users] GPU configuration

2021-12-10 Thread Giuseppe G. A. Celano

Hi,

My cluster has 2 nodes, with the first having 2 gpus and the second 1 gpu.
The states of both nodes is "drained" because "gres/gpu count reported
lower than configured": any idea why this happens? Thanks.

My .conf files are:

slurm.conf

AccountingStorageTRES=gres/gpu
GresTypes=gpu
NodeName=technician Gres=gpu:2 CPUs=28 RealMemory=128503 Boards=1
SocketsPerBoard=1 CoresPerSocket=14 ThreadsPerCore=2 State=UNKNOWN
NodeName=worker0 Gres=gpu:1 CPUs=12 RealMemory=15922 Boards=1
SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2 State=UNKNOWN
PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP

gres.conf

NodeName=technician Name=gpu File=/dev/nvidia[0-1]
NodeName=worker0 Name=gpu File=/dev/nvidia0

Best,
Giuseppe

[slurm-users] Can't start slurmdbd

2017-11-20 Thread Juan A. Cordero Varelaq


Hi,

Slurm 17.02.3 was installed on my cluster some time ago but recently I 
decided to use SlurmDBD for the accounting.


After installing several packages (slurm-devel, slurm-munge, 
slurm-perlapi, slurm-plugins, slurm-slurmdbd and slurm-sql) and MariaDB 
in CentOS 7, I created an SQL database:


mysql> grant all on slurm_acct_db.* TO 'slurm'@'localhost'
-> identified by 'some_pass' with grant option;
mysql> create database slurm_acct_db;

and configured the slurmdbd.conf file:

AuthType=auth/munge
DbdAddr=localhost
DbdHost=localhost
SlurmUser=slurm
DebugLevel=4
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePass=some_pass
StorageUser=slurm
StorageLoc=slurm_acct_db

Then, I stopped the slurmctl daemon on the head node of my cluster and 
tried to start `slurmdbd`, but I got the following:


$ systemctl start slurmdbd
Job for slurmdbd.service failed because the control process exited 
with error code. See "systemctl status slurmdbd.service" and "journalctl 
-xe" for details.

$ systemctl status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/etc/systemd/system/slurmdbd.service; enabled; 
vendor preset: disabled)
   Active: failed (Result: exit-code) since lun 2017-11-20 10:39:26 
CET; 53s ago
  Process: 27592 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS 
(code=exited, status=1/FAILURE)


nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD 
accounting daemon...
nov 20 10:39:26 login_node systemd[1]: slurmdbd.service: control 
process exited, code=exited status=1
nov 20 10:39:26 login_node systemd[1]: Failed to start Slurm DBD 
accounting daemon.
nov 20 10:39:26 login_node systemd[1]: Unit slurmdbd.service 
entered failed state.

nov 20 10:39:26 login_node systemd[1]: slurmdbd.service failed.
$ journalctl -xe
nov 20 10:39:26 login_node polkitd[1078]: Registered Authentication 
Agent for unix-process:27586:119889015 (system bus name :1.871 
[/usr/bin/pkttyagent --notify-fd 5 --fallback], object path /or
nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD 
accounting daemon...

-- Subject: Unit slurmdbd.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit slurmdbd.service has begun starting up.
nov 20 10:39:26 login_node systemd[1]: slurmdbd.service: control 
process exited, code=exited status=1
nov 20 10:39:26 login_node systemd[1]: Failed to start Slurm DBD 
accounting daemon.

-- Subject: Unit slurmdbd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit slurmdbd.service has failed.
--
-- The result is failed.
nov 20 10:39:26 login_node systemd[1]: Unit slurmdbd.service 
entered failed state.

nov 20 10:39:26 login_node systemd[1]: slurmdbd.service failed.
nov 20 10:39:26 login_node polkitd[1078]: Unregistered 
Authentication Agent for unix-process:27586:119889015 (system bus name 
:1.871, object path /org/freedesktop/PolicyKit1/AuthenticationAgent,
nov 20 10:40:06 login_node gmetad[1519]: data_thread() for [HPCSIE] 
failed to contact node 192.168.2.10
nov 20 10:40:06 login_node gmetad[1519]: data_thread() got no 
answer from any [HPCSIE] datasource
nov 20 10:40:13 login_node dhcpd[2320]: DHCPREQUEST for 
192.168.2.19 from XX:XX:XX:XX:XX:XX via enp6s0f1
nov 20 10:40:13 login_node dhcpd[2320]: DHCPACK on 192.168.2.19 to 
XX:XX:XX:XX:XX:XX via enp6s0f1
nov 20 10:40:39 login_node dhcpd[2320]: DHCPREQUEST for 
192.168.2.13 from XX:XX:XX:XX:XX:XX via enp6s0f1
nov 20 10:40:39 login_node dhcpd[2320]: DHCPACK on 192.168.2.13 to 
XX:XX:XX:XX:XX:XX via enp6s0f1


I've just found out the file `/var/run/slurmdbd.pid` does not even exist.

I'd appreciate any hint on this issue.

Thanks

Re: [slurm-users] Can't start slurmdbd

2017-11-20 Thread Juan A. Cordero Varelaq


I did that but got the same errors.

slurmdbd.log contains by the way the following:

   [2017-11-20T12:39:04.178] error: Couldn't find the specified plugin
   name for accounting_storage/mysql looking at all files
   [2017-11-20T12:39:04.179] error: cannot find accounting_storage
   plugin for accounting_storage/mysql
   [2017-11-20T12:39:04.179] error: cannot create accounting_storage
   context for accounting_storage/mysql
   [2017-11-20T12:39:04.179] fatal: Unable to initialize
   accounting_storage/mysql accounting storage plugin

It seems it lacks the accounting_storage_mysql.so:

   $ ls /usr/lib64/slurm/accounting_storage_*
   /usr/lib64/slurm/accounting_storage_filetxt.so
   /usr/lib64/slurm/accounting_storage_none.so
   /usr/lib64/slurm/accounting_storage_slurmdbd.so

However, I did install the slurm-sql rpm package.

Any idea about what's failing?


Thanks

On 20/11/17 12:11, Lachlan Musicman wrote:
On 20 November 2017 at 20:50, Juan A. Cordero Varelaq 
mailto:bioinformatica-i...@us.es>> wrote:


$ systemctl start slurmdbd
Job for slurmdbd.service failed because the control process
exited with error code. See "systemctl status slurmdbd.service"
and "journalctl -xe" for details.
$ systemctl status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/etc/systemd/system/slurmdbd.service;
enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since lun 2017-11-20
10:39:26 CET; 53s ago
  Process: 27592 ExecStart=/usr/sbin/slurmdbd
$SLURMDBD_OPTIONS (code=exited, status=1/FAILURE)

nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD
accounting daemon...
nov 20 10:39:26 login_node systemd[1]: slurmdbd.service:
control process exited, code=exited status=1
nov 20 10:39:26 login_node systemd[1]: Failed to start Slurm
DBD accounting daemon.
nov 20 10:39:26 login_node systemd[1]: Unit slurmdbd.service
entered failed state.
nov 20 10:39:26 login_node systemd[1]: slurmdbd.service failed.
$ journalctl -xe
nov 20 10:39:26 login_node polkitd[1078]: Registered
Authentication Agent for unix-process:27586:119889015 (system bus
name :1.871 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object
path /or
nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD
accounting daemon...
-- Subject: Unit slurmdbd.service has begun start-up
-- Defined-By: systemd
-- Support:
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
<http://lists.freedesktop.org/mailman/listinfo/systemd-devel>
--
-- Unit slurmdbd.service has begun starting up.
nov 20 10:39:26 login_node systemd[1]: slurmdbd.service:
control process exited, code=exited status=1
nov 20 10:39:26 login_node systemd[1]: Failed to start Slurm
DBD accounting daemon.
-- Subject: Unit slurmdbd.service has failed
-- Defined-By: systemd
-- Support:
http://lists.freedesktop.org/mailman/listinfo/systemd-devel
<http://lists.freedesktop.org/mailman/listinfo/systemd-devel>
--
-- Unit slurmdbd.service has failed.
--
-- The result is failed.
nov 20 10:39:26 login_node systemd[1]: Unit slurmdbd.service
entered failed state.
nov 20 10:39:26 login_node systemd[1]: slurmdbd.service failed.
nov 20 10:39:26 login_node polkitd[1078]: Unregistered
Authentication Agent for unix-process:27586:119889015 (system bus
name :1.871, object path
/org/freedesktop/PolicyKit1/AuthenticationAgent,
nov 20 10:40:06 login_node gmetad[1519]: data_thread() for
[HPCSIE] failed to contact node 192.168.2.10
nov 20 10:40:06 login_node gmetad[1519]: data_thread() got no
answer from any [HPCSIE] datasource
nov 20 10:40:13 login_node dhcpd[2320]: DHCPREQUEST for
192.168.2.19 from XX:XX:XX:XX:XX:XX via enp6s0f1
nov 20 10:40:13 login_node dhcpd[2320]: DHCPACK on
192.168.2.19 to XX:XX:XX:XX:XX:XX via enp6s0f1
nov 20 10:40:39 login_node dhcpd[2320]: DHCPREQUEST for
192.168.2.13 from XX:XX:XX:XX:XX:XX via enp6s0f1
nov 20 10:40:39 login_node dhcpd[2320]: DHCPACK on
192.168.2.13 to XX:XX:XX:XX:XX:XX via enp6s0f1

I've just found out the file `/var/run/slurmdbd.pid` does not even
exist.



The pid file is the "process id" - it's only there if the process is 
running. So when slurmdbd is not running, it wont be there. 
Supposedly. Sometimes I do "touch /var/run/slurmdbd.pid" and try again?


I've also found that using the host's short name is preferable to 
localhost. Make sure the host's short name is in /etc/hosts too.


hostname -s

will give you the short name

Cheers
L.

Re: [slurm-users] Can't start slurmdbd

2017-11-21 Thread Juan A. Cordero Varelaq

I guess mariadb-devel was not installed by the time another person 
installed slurm. I have a bunch of slurm-* rpms I installed using "yum 
localinstall ...". Should I installed them in another way or remove slurm?


The file accounting_storage_mysql.so is bythe way absent on the machine.

Thanks
On 20/11/17 21:52, Lachlan Musicman wrote:
Also - make sure you have MariaDB-devel when you make the RPMs - 
that's the first bit.
The second bit is you might have to find the 
accounting_storage_mysql.so and place it in /usr/lib64/slurm.


I think it might end up in 
/path/to/rpmbuild/BUILD/sec/plugins/accounting/.libs/ or something 
like that


Cheers
L.

--
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic 
civics is the insistence that we cannot ignore the truth, nor should 
we panic about it. It is a shared consciousness that our institutions 
have failed and our ecosystem is collapsing, yet we are still here — 
and we are creative agents who can shape our destinies. Apocalyptic 
civics is the conviction that the only way out is through, and the 
only way through is together. "


/Greg Bloom/ @greggish 
https://twitter.com/greggish/status/873177525903609857


On 21 November 2017 at 06:35, Philip Kovacs <mailto:pkde...@yahoo.com>> wrote:


Try adding this to your conf:

PluginDir=/usr/lib64/slurm


On Monday, November 20, 2017 6:48 AM, Juan A. Cordero Varelaq
mailto:bioinformatica-i...@us.es>> wrote:


I did that but got the same errors.
slurmdbd.log contains by the way the following:

[2017-11-20T12:39:04.178] error: Couldn't find the specified
plugin name for accounting_storage/mysql looking at all files
[2017-11-20T12:39:04.179] error: cannot find
accounting_storage plugin for accounting_storage/mysql
[2017-11-20T12:39:04.179] error: cannot create
accounting_storage context for accounting_storage/mysql
[2017-11-20T12:39:04.179] fatal: Unable to initialize
accounting_storage/mysql accounting storage plugin

It seems it lacks the accounting_storage_mysql.so:

$ ls /usr/lib64/slurm/accounting_storage_*
/usr/lib64/slurm/accounting_storage_filetxt.so
/usr/lib64/slurm/accounting_storage_none.so
/usr/lib64/slurm/accounting_storage_slurmdbd.so

However, I did install the slurm-sql rpm package.
Any idea about what's failing?

Thanks
On 20/11/17 12:11, Lachlan Musicman wrote:

On 20 November 2017 at 20:50, Juan A. Cordero Varelaq
mailto:bioinformatica-i...@us.es>> wrote:

$ systemctl start slurmdbd
Job for slurmdbd.service failed because the control
process exited with error code. See "systemctl status
slurmdbd.service" and "journalctl -xe" for details.
$ systemctl status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
   Loaded: loaded (/etc/systemd/system/slurmdbd. service;
enabled; vendor preset: disabled)

   Active: failed (Result: exit-code) since lun
2017-11-20 10:39:26 CET; 53s ago
  Process: 27592 ExecStart=/usr/sbin/slurmdbd
$SLURMDBD_OPTIONS (code=exited, status=1/FAILURE)

nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD
accounting daemon...
nov 20 10:39:26 login_node systemd[1]: slurmdbd.service:
control process exited, code=exited status=1
nov 20 10:39:26 login_node systemd[1]: Failed to start
Slurm DBD accounting daemon.
nov 20 10:39:26 login_node systemd[1]: Unit
slurmdbd.service entered failed state.
nov 20 10:39:26 login_node systemd[1]: slurmdbd.service
failed.
$ journalctl -xe
nov 20 10:39:26 login_node polkitd[1078]: Registered
Authentication Agent for unix-process:27586:119889015 (system
bus name :1.871 [/usr/bin/pkttyagent --notify-fd 5
--fallback], object path /or
nov 20 10:39:26 login_node systemd[1]: Starting Slurm DBD
accounting daemon...
-- Subject: Unit slurmdbd.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/m
ailman/listinfo/systemd-devel
<http://lists.freedesktop.org/mailman/listinfo/systemd-devel>
--
-- Unit slurmdbd.service has begun starting up.
nov 20 10:39:26 login_node systemd[1]: slurmdbd.service:
control process exited, code=exited status=1
nov 20 10:39:26 login_node systemd[1]: Failed to start
Slurm DBD accounting daemon.
-- Subject: Unit slurmdbd.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/m
ailman/listinfo/systemd-devel
<http://lists.freedesktop.org/mailman/l

[slurm-users] which daemons should I restart when editing slurm.conf

2018-01-04 Thread Juan A. Cordero Varelaq


Hi,

I have the following configuration:

 * head node: hosts the slurmctld and the slurmdbd daemons.
 * compute nodes (4): host the slurmd daemons.

I need to change a couple of lines of the slurm.conf corresponding to 
the slurmctld. If I restart its service, should I also have to restart 
the slurmdbd on the head node and the slurmd daemons on compute nodes?



Thanks

[slurm-users] Changing resource limits while running jobs

2018-01-04 Thread Juan A. Cordero Varelaq


Hi,


A couple of jobs have been running for almost one month and I would like 
to change resource limits to prevent users from running so much time. 
Besides, I'd like to set AccountingStorageEnforce to qos,safe. If I make 
such changes would the running jobs be stopped (the user running the 
jobs has still no account and therefore, should not be allowed to run 
anything if AccountingStorageEnforce is set)?



Thanks*
*

Re: [slurm-users] Changing resource limits while running jobs

2018-01-04 Thread Juan A. Cordero Varelaq

And could I restart the slurmctld daemon without affecting such running 
jobs?



On 04/01/18 15:56, Paul Edmon wrote:


Typically changes like this only impact pending or newly submitted 
jobs.  Running jobs usually are not impacted, though they will count 
against any new restrictions that you put in place.


-Paul Edmon-


On 1/4/2018 6:44 AM, Juan A. Cordero Varelaq wrote:


Hi,


A couple of jobs have been running for almost one month and I would 
like to change resource limits to prevent users from running so much 
time. Besides, I'd like to set AccountingStorageEnforce to qos,safe. 
If I make such changes would the running jobs be stopped (the user 
running the jobs has still no account and therefore, should not be 
allowed to run anything if AccountingStorageEnforce is set)?



Thanks*
*

[slurm-users] restrict application to a given partition

2018-01-12 Thread Juan A. Cordero Varelaq


Dear Community,

I have a node (20 Cores) on my HPC with two different partitions: big 
(16 cores) and small (4 cores). I have installed software X on this 
node, but I want only one partition to have rights to run it.
Is it then possible to restrict the execution of an specific application 
to a given partition on a given node?


Thanks

Re: [slurm-users] restrict application to a given partition

2018-01-15 Thread Juan A. Cordero Varelaq

But what if the user knows the path to such application (let's say 
python command) and  executes it on the partition he/she should not be 
allowed to? Is it possible through lua scripts to set constrains on 
software usage such as a limited shell, for instance?


In fact, what I'd like to implement is something like a limited shell, 
on a particular node for a particular partition and a particular program.



On 12/01/18 17:39, Paul Edmon wrote:
You could do this using a job_submit.lua script that inspects for that 
application and routes them properly.


-Paul Edmon-


On 01/12/2018 11:31 AM, Juan A. Cordero Varelaq wrote:

Dear Community,

I have a node (20 Cores) on my HPC with two different partitions: big 
(16 cores) and small (4 cores). I have installed software X on this 
node, but I want only one partition to have rights to run it.
Is it then possible to restrict the execution of an specific 
application to a given partition on a given node?


Thanks

Re: [slurm-users] restrict application to a given partition

2018-01-16 Thread Juan A. Cordero Varelaq

I ended up with a more simple solution: I tweaked the program executable 
(a bash script), so that it inspects which partition it is running on, 
and if its the wrong one, it exits. Just added the following lines:


if [ $SLURM_JOB_PARTITION == 'big' ]; then
exit_code=126
/bin/echo "PROGRAM failed with exit code $exit_code. 
PROGRAM was executed on a wrong SLURM Partition."

exit $exit_code
fi


On 15/01/18 16:03, Paul Edmon wrote:


This sounds like a solution for singularity.

http://singularity.lbl.gov/

You could use the Lua script to restrict what is permitted to run via 
barring anything that isn't a specific singularity script. Else you 
could use either prolog scripts to act as emergency fall back in case 
the lua script doesn't catch it.


-Paul Edmon-

On 1/15/2018 8:31 AM, John Hearns wrote:

Juan, me kne-jerk reaction is to say 'containerisation' here.
However I guess that means that Slurm would have to be able to 
inspect the contents of a container, and I do not think that is possible.

I may be very wrong here. Anyone?


However have a look at thre Xalt stuff from TACC
https://www.tacc.utexas.edu/research-development/tacc-projects/xalt
https://github.com/Fahey-McLay/xalt


Xalt is intended to instrument your cluster and collect information 
on what software is being run and exactly what libraries are being used.
I do not think it has any options for "Nope! You may not run this 
executable on this partition"

However it might be worth contacting the authors and discussing this.





On 15 January 2018 at 14:20, Juan A. Cordero Varelaq 
mailto:bioinformatica-i...@us.es>> wrote:


But what if the user knows the path to such application (let's
say python command) and  executes it on the partition he/she
should not be allowed to? Is it possible through lua scripts to
set constrains on software usage such as a limited shell, for
instance?

In fact, what I'd like to implement is something like a limited
shell, on a particular node for a particular partition and a
particular program.



On 12/01/18 17:39, Paul Edmon wrote:

You could do this using a job_submit.lua script that inspects
for that application and routes them properly.

-Paul Edmon-


On 01/12/2018 11:31 AM, Juan A. Cordero Varelaq wrote:

Dear Community,

I have a node (20 Cores) on my HPC with two different
partitions: big (16 cores) and small (4 cores). I have
installed software X on this node, but I want only one
partition to have rights to run it.
Is it then possible to restrict the execution of an
specific application to a given partition on a given node?

Thanks

[slurm-users] constrain partition to a unique shell

2018-01-24 Thread Juan A. Cordero Varela


Dear users,

I would like to force the use of only one type of shell, let's say, 
bash, on a partition that shares a node with another one. Do you know if 
it's possible to do it?


What I actually want to do is to install a limited shell (lshell) on one 
node and force a given partition to be able to use ONLY that shell, so 
that the users can run a small set of commands.


Thanks

[slurm-users] Tracking costs - variable costs per partition

2025-03-05 Thread Jeffrey A Dusenberry via slurm-users

Hello -

We're in a similar situation as was described here:

https://groups.google.com/g/slurm-users/c/eBDslkwoFio

where we want to track (and control) costs on a fairly heterogenous system with 
different billing weights per partition.  The solution proposed seems like it 
would work rather well, except our use of fairshare seems to interfere with the 
billing values we would want to use to limit usage based on credits granted.  
We have PriorityDecayHalfLife set on our system, so that billing value 
(GrpTRESRaw) seems to drop with time.

Is there a way to implement something similar on an otherwise fairshare-based 
system?

Thanks,
Jeff


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

55 matches

Mail list logo