[slurm-users] QOS MaxTRESPU node=X intepretation

2024-08-30 Thread David Magda via slurm-users
info. Regards, David -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Best practices for tracking jobs started across multiple clusters for accounting purposes.

2024-08-29 Thread David via slurm-users
Hello, What is meant here by "tracking"? What information are you looking to gather and track? I'd say the simplest answer is using sacct, but I am not sure how federated/non-federated setups come into play while using it. David On Tue, Aug 27, 2024 at 6:23 AM Di Bernardini,

[slurm-users] Re: Temporarily bypassing pam_slurm_adopt.so

2024-07-08 Thread David Schanzenbach via slurm-users
Hi Daniel, Utilizing pam_access  with pam_slurm_adopt might be what you are looking for? https://slurm.schedmd.com/pam_slurm_adopt.html#admin_access Thanks, David On 7/8/2024 10:54 AM, Daniel L'Hommedieu via slurm-users wrote: Hi, all. We have a use case where we need to allow a gro

[slurm-users] Re: Node (anti?) Feature / attribute

2024-06-17 Thread David Magda via slurm-users
Could you post that snippet? > On Jun 14, 2024, at 14:33, Laura Hild via slurm-users > wrote: > > I wrote a job_submit.lua also. It would append "¢os79" to the feature > string unless the features already contained "el9," or if empty, set the > features string to "centos79" without the ampe

[slurm-users] Re: Node (anti?) Feature / attribute

2024-06-17 Thread David Magda via slurm-users
This functionality in slurmd was added in August 2023, so not in the version we’re currently running: https://github.com/SchedMD/slurm/commit/0daa1fda97c125c0b1c48cbdcdeaf1382ed71c4f Perhaps something for the future. Currently looking like the job_submit.lua is the best candidate.

[slurm-users] Node (anti?) Feature / attribute

2024-06-14 Thread David Magda via slurm-users
itions, and not (AFAICT) on a per node basis. We’re currently on 22.05.x, but upgrading is fine. Regards, David -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] A fairshare policy that spans multiple clusters

2024-01-05 Thread David Baker
r clusters – if that makes sense. Does anyone have any thoughts on this question, please? Am I correct in thinking that federating clusters is related to my question? Do I gather correctly, however, that federation only works if there is a common database on a shared file system? Best regards, David

Re: [slurm-users] TRES sreport per association

2023-11-16 Thread David
be very lengthy output. HTH, David On Sun, Nov 12, 2023 at 6:03 PM Kamil Wilczek wrote: > Dear All, > > is is possible to report GPU Minutes per association? Suppose > I have two associations like this: > >sacctmgr show assoc where user=$(whoami) > format=account%10,use

Re: [slurm-users] Steps to upgrade slurm for a patchlevel change?

2023-09-28 Thread David
lurmd nodes. > > Is there an expedited, simple, slimmed down upgrade path to follow if > we're looking at just a . level upgrade? > > Rob > > -- David Rhey --- Advanced Research Computing University of Michigan

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread David
DC (USA) wrote: > On Sep 21, 2023, at 9:46 AM, David wrote: > > Slurm is working as it should. From your own examples you proved that; by > not submitting to b4 the job works. However, looking at man sbatch: > >-p, --partition= > Request a specific p

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread David
tion it prevents jobs > from being queued! > Nothing in the documentation about --partition made me think that > forbidding access to one partition would make a job unqueueable... > > Diego > > Il 21/09/2023 14:41, David ha scritto: > > I would think that slurm would only

Re: [slurm-users] Weirdness with partitions

2023-09-21 Thread David
to avoid having to replicate scheduler logic in > job_submit.lua... :) > > -- > Diego Zuccato > DIFA - Dip. di Fisica e Astronomia > Servizi Informatici > Alma Mater Studiorum - Università di Bologna > V.le Berti-Pichat 6/2 - 40127 Bologna - Italy > tel.: +39 051 20 95786 > > -- David Rhey --- Advanced Research Computing University of Michigan

Re: [slurm-users] running mpi from inside an mpi job

2023-06-20 Thread David Schanzenbach
adding the --overlap flag to the srun call for the parent mpi process fixes the problem. https://slurm.schedmd.com/srun.html#OPT_overlap Thanks, David On 6/20/2023 4:08 AM, Vanhorn, Mike wrote: I have a user who is submitting a job to slurm which requests 16 tasks, i.e. #SBATCH --ntasks 16

Re: [slurm-users] seff in slurm-23.02

2023-05-25 Thread David Gauchard
Advanced Research Computing* Information and Technology Solutions (ITS) 303-273-3786 | mrobb...@mines.edu <mailto:mrobb...@mines.edu> A close up of a sign Description automatically generated *Our values:*Trust | Integrity | Respect | Responsibility *From: *slurm-users on behalf o

[slurm-users] seff in slurm-23.02

2023-05-25 Thread David Gauchard
Hello, slurm-23.02 on ubuntu-20.04, seff is not working anymore: ``` # ./seff 4911385 Use of uninitialized value $FindBin::Bin in concatenation (.) or string at ./seff line 11. Name "FindBin::Bin" used only once: possible typo at ./seff line 11, line 602. perl: error: slurm_persist_conn_open:

[slurm-users] batched and efficient job status queries by snakemake using sacct

2023-03-15 Thread David Laehnemann
d to the current solution! cheers, david

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread David Laehnemann
, so that others can hopefully reuse as much as they can in their contexts. But maybe some publicly available best practices (and no-gos) for slurm cluster users would be a useful resource that cluster admins can then point / link to. cheers, david On Mon, 2023-02-27 at 06:53 -0800, Brian Andr

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread David Laehnemann
y used workflow manager in my field (bioinformatics), there's also an issue discussing Slurm job array support: https://github.com/nextflow-io/nextflow/issues/1477 cheers, david On Mon, 2023-02-27 at 13:24 +0100, Ward Poelmans wrote: > On 24/02/2023 18:34, David Laehnemann wrote: > > Those

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread David Laehnemann
k heads-up: I am documenting your input by linking to the mailing list archives, I hope that's alright for you? https://github.com/snakemake/snakemake/pull/2136#issuecomment-1446170467 cheers, david On Sat, 2023-02-25 at 10:51 -0800, Chris Samuel wrote: > On 23/2/23 2:55 am, Davi

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-24 Thread David Laehnemann
Id can be non- unique? That would indeed spell trouble on a different level, and make status checks much more complicated... cheers, david On Thu, 2023-02-23 at 11:59 -0500, Sean Maxwell wrote: > Hi David, > > On Thu, Feb 23, 2023 at 10:50 AM David Laehnemann < > david.laehnem...@hhu

Re: [slurm-users] snakemake and slurm in general

2023-02-24 Thread David Laehnemann
logic, but probably isn't impossible. And it seems to have been discussed, even recently (and I think, even with a recent contribution by you;): https://github.com/snakemake/snakemake/issues/301 I'll try to keep revisiting this, if I can find the time. cheers, david On Fri, 2023-02-24 at 08:

Re: [slurm-users] snakemake and slurm in general

2023-02-23 Thread David Laehnemann
workflow management system giving you additional control over things. So I'm not sure what exactly we are arguing about, right here... cheers, david On Thu, 2023-02-23 at 17:41 +0100, Ole Holm Nielsen wrote: > On 2/23/23 17:07, David Laehnemann wrote: > > In addition, there are very clear

[slurm-users] snakemake and slurm in general

2023-02-23 Thread David Laehnemann
eeper knowledge of such cluster systems providing their help along the way, which is why I am on this list now, asking for insights. So feel free to dig into the respective code bases with a bit of that grumpy energy, making snakemake or nextflow a bit better in how they deal with Slurm. cheers,

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread David Laehnemann
more efficiently (and better tailored to how Slurm does things) is appreciated. cheers, david On Thu, 2023-02-23 at 09:46 -0500, Sean Maxwell wrote: > Hi David, > > On Thu, Feb 23, 2023 at 8:51 AM David Laehnemann < > david.laehnem...@hhu.de> > wrote: > > > Quick

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread David Laehnemann
sacct that slurmdbd will gracefully handle (per second)? Or any suggestions how to roughly determine such a rate for a given cluster system? cheers, david P.S.: @Loris and @Noam: Exactly, snakemake is a software distinct from slurm that you can use to orchestrate large analysis workflows---on

[slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-23 Thread David Laehnemann
ral? Many thanks and best regards, David

Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread David Thompson
group name. Again, thanks Paul/Brian for the assistance. David Thompson University of Wisconsin – Madison Social Science Computing Cooperative From: David Thompson Sent: Friday, December 2, 2022 1:13 PM To: slurm-users@lists.schedmd.com Subject: RE: [slurm-users] Slurm v22 for Alma 8 Hi Paul

Re: [slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread David Thompson
appreciate the help. David Thompson University of Wisconsin – Madison Social Science Computing Cooperative From: slurm-users On Behalf Of Paul Edmon Sent: Friday, December 2, 2022 11:26 AM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Slurm v22 for Alma 8 Yup, here is the spec we use

[slurm-users] Slurm v22 for Alma 8

2022-12-02 Thread David Thompson
ol test in testsuite/slurm_unit/common/slurm_protocol_defs: FAIL: slurm_addto_id_char_list-test Before I start digging in, I thought I would check here and see if anyone has a successful RHEL/Alma/Rocky 8 slurm v22 SRPM they'd be willing to share. Thanks much! David Thompson University of Wisc

Re: [slurm-users] Possible to get cluster utilization by partition?

2022-08-24 Thread Chin,David
I cooked one up myself using Python (with Pandas) which I feel is more maintainable. https://github.com/prehensilecode/slurm_utilization/blob/main/utilization_from_sacct.py It's still in pretty rough shape, and could certainly use some refining. Cheers, Dave -- David Chin, PhD (h

Re: [slurm-users] srun: error: io_init_msg_unpack: unpack error

2022-08-08 Thread David Magda
On Aug 6, 2022, at 15:13, Chris Samuel wrote: > > On 6/8/22 10:43 am, David Magda wrote: > >> It seems that the the new srun(1) cannot talk to the old slurmd(8). >> Is this 'on purpose'? Does the backwards compatibility of the protocol not >> extend t

[slurm-users] srun: error: io_init_msg_unpack: unpack error

2022-08-06 Thread David Magda
e backwards compatibility of the protocol not extend to srun(1)? Is there any way around this, or should we simply upgrade slurmd(8) on the work nodes, but leave the paths to the older user CLI utilities alone until all the compute nodes have been upgraded? Thanks for any info. Regards, David

Re: [slurm-users] Rolling reboot with at most N machines down simultaneously?

2022-08-04 Thread David Simpson
Another way might be to implement slurm power off/on (if not already) and induce it as required. - David Simpson - Senior Systems Engineer ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB

[slurm-users] Is there a way create reservations w/o being Operator or Admin?

2022-07-11 Thread David Henkemeyer
our users into the accounting DB (and then add new users when we bring new people onboard). Thanks in advance, David

Re: [slurm-users] Question about having 2 partitions that are mutually exclusive, but have unexpected interactions

2022-05-12 Thread David Henkemeyer
other focuses on the rest. Or something similar. David On Thu, May 12, 2022 at 9:13 AM Brian Andrus wrote: > I suspect you have too low of a setting for "MaxJobCount" > > *MaxJobCount* > The maximum number of jobs SLURM can have in its active database >

[slurm-users] Question about having 2 partitions that are mutually exclusive, but have unexpected interactions

2022-05-12 Thread David Henkemeyer
lurm parameter I can tweak to make slurm recognize that these partition B jobs shouldn't ever have a pending state of "priority". Or to treat these as 2 separate queues. Or something like that. Spinning up a 2nd slurm controller is not ideal for us (uless there is a lightweight method to do it). Thanks David

[slurm-users] How to run a job at the end of a set of jobs

2022-05-09 Thread David Henkemeyer
e the last job. What would be the various ways to achieve this? Thanks David

Re: [slurm-users] Is sacct not handling quotes properly?

2022-05-04 Thread David Henkemeyer
-- sbatch --export=NONE --wrap=uname -a --exclusive So, its storing properly, now I need to see if I can figure out how to preserve/add the quotes on the way out of the DB... David On Wed, May 4, 2022 at 11:15 AM Michael Jennings wrote: > On Wednesday, 04 May 2022, at 10:00:57 (-0700), > Davi

[slurm-users] Is sacct not handling quotes properly?

2022-05-04 Thread David Henkemeyer
x27;s stripping the quotes? This seems unlikely to me. Thanks in advance! David

Re: [slurm-users] gres/gpu count lower than reported

2022-05-03 Thread David Henkemeyer
then try taking them back into the idle state. Also, keep an eye on the slurmctld and slurmd logs. They usually are quite helpful to highlight what the issue is. David On Tue, May 3, 2022 at 11:50 AM Jim Kavitsky wrote: > Hello Fellow Slurm Admins, > > > > I have a new Slurm instal

[slurm-users] Looking for examples of daily job reports

2022-04-15 Thread David Henkemeyer
c and not that easy to parse. I know that there are 3rd party tools that can help with this. I'd love to hear/see what others are doing. Thanks David

Re: [slurm-users] Memory usage not tracked

2022-04-06 Thread Chin,David
TIME TIME_LIMIT NODES MIN_MEMO NODELIST(REASON) 2514854 def ClusterJobStart_ sbradley RUNNING5:05:27 8:00:00 1 36G node003 -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp

Re: [slurm-users] Node is not allocating all CPUs

2022-04-06 Thread Guertin, David S.
slurm.conf contains the following: SelectType=select/cons_tres SelectTypeParameters=CR_Core AccountingStorageTRES=gres/gpu Could this be constraining CgfTRES=cpu=16 somehow? David Guertin From: Guertin, David S. Sent: Wednesday, April 6, 2022 12:27 PM To: Slurm

Re: [slurm-users] Node is not allocating all CPUs

2022-04-06 Thread Guertin, David S.
uld the number of trackable resources be different from the number of actual CPUs? David Guertin From: slurm-users on behalf of Sarlo, Jeffrey S Sent: Wednesday, April 6, 2022 10:30 AM To: Slurm User Community List Subject: Re: [slurm-users] Node is

Re: [slurm-users] Node is not allocating all CPUs

2022-04-06 Thread Guertin, David S.
g 16. David Guertin From: slurm-users on behalf of Brian Andrus Sent: Tuesday, April 5, 2022 6:14 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Node is not allocating all CPUs You want to see what is output on the node itself when yo

[slurm-users] Node is not allocating all CPUs

2022-04-05 Thread Guertin, David S.
CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Why isn't this node allocating all 32 cores? Thanks, David Guertin

Re: [slurm-users] Can I define and use custom env vars in slurm.conf?

2022-04-04 Thread David Henkemeyer
That's exactly what I needed! Thank you, David On Mon, Apr 4, 2022 at 1:17 PM Brian Andrus wrote: > I think you are looking at nodesets: > > > From the slurm.conf man: > > NODESET CONFIGURATION > The nodeset configuration allows you to define a name for a specifi

[slurm-users] Can I define and use custom env vars in slurm.conf?

2022-04-04 Thread David Henkemeyer
=$NODEPOOL1 MaxTime=INFINITE State=UP PartitionName=interactive Nodes=$NODEPOOL1 MaxTime=INFINITE State=UP PriorityJobFactor=2 PartitionName=perfNodes=$PERFNODES MaxTime=INFINITE State=UP PriorityJobFactor=2 Is this possible? Thanks, David

[slurm-users] Why is --cpu_bind not an option for sbatch? Why only srun?

2022-03-31 Thread David Henkemeyer
We noticed that we can pass --cpu_bind into an srun commandline, but not sbatch. Why is that? Thanks David

Re: [slurm-users] Question about sbatch options: -n, and --cpus-per-task

2022-03-24 Thread David Henkemeyer
Thank you! We recently converted from pbs, and I was converting “ppn=X” to “-n X”. Does it make more sense to convert “ppn=X” to --“cpus-per-task=X”? Thanks again David On Thu, Mar 24, 2022 at 3:54 PM Thomas M. Payerle wrote: > Although all three cases ( "-N 1 --cpus-per-task 64 -n 1

Re: [slurm-users] Question about sbatch options: -n, and --cpus-per-task

2022-03-24 Thread David Henkemeyer
is significant. > > > On Mar 24, 2022, at 12:32 PM, David Henkemeyer < > david.henkeme...@gmail.com> wrote: > > > > Assuming -N is 1 (meaning, this job needs only one node), then is there > a difference between any of these 3 flag combinations: > > > > -n

[slurm-users] Question about sbatch options: -n, and --cpus-per-task

2022-03-24 Thread David Henkemeyer
functional difference. But if there is even a subtle difference, I would love to know what it is! Thanks David -- Sent from Gmail Mobile

Re: [slurm-users] monitoring and update regime for Power Saving nodes

2022-02-24 Thread David Simpson
a dummy job to bring powered down nodes up then a clustershell slurmd stop is probably the answer regards David - David Simpson - Senior Systems Engineer ARCCA, Redwood Building, King Edward VII Avenue, Ca

Re: [slurm-users] monitoring and update regime for Power Saving nodes

2022-02-24 Thread David Simpson
nd any down nodes will automatically read the latest. Yes, currently we use file based and config written to the compute node’s disks themselves via ansible. Perhaps we will consider moving the file to a shared fs. regards David - David Simpson - Senior Systems Engineer ARCCA, Redwood

[slurm-users] monitoring and update regime for Power Saving nodes

2022-02-23 Thread David Simpson
anything else) to a node which is down due to power saving (during a maintenance/reservation) what is your approach? Do you end up with 2 slurm.confs (one for power saving and one that keeps everything up, to work on during the maintenance)? thanks David - David Simpson - Senior

[slurm-users] Questions about default_queue_depth

2022-01-12 Thread David Henkemeyer
selected? 3) Is there a way to see the order of the jobs in the queue? Perhaps squeue lists the jobs in order? 3) If we had several partitions, would the default_queue_dpeth apply to all partitions? Thank you David

[slurm-users] How to limit # of execution slots for a given node

2022-01-06 Thread David Henkemeyer
tion, it seems to me. At least, it left me feeling like there has to be a better way. Thanks! David

[slurm-users] A Slurm topological scheduling question

2021-12-07 Thread David Baker
not happy, by the way, to have node/switch connections across racks. Best regards, David

Re: [slurm-users] Possible to get cluster utilization by partition?

2021-11-05 Thread Chin,David
would be a generally useful feature. Cheers, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:preh

[slurm-users] Possible to get cluster utilization by partition?

2021-11-04 Thread Chin,David
understanding usage if a similar report could be produced for each partition. I tried the obvious, adding "partitions=gpu", but that option isn't applicable to the cluster utilization report: it just produces the same output as the above command. Cheers, Dave -- David Chin, PhD (he/him)

[slurm-users] Bug when I run "sinfo --states=idle"

2021-10-28 Thread David Henkemeyer
E NODELIST debugup infinite 1 drain node6 debug1* up infinite 0n/a (! 809)-> sinfo --states=down PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debugup infinite 1 down* node1 debug1* up infinite 0n/a Is this a known issue? We are running 21.08.0 David

[slurm-users] Can I get the original sbatch command, after the fact?

2021-07-16 Thread David Henkemeyer
If I execute a bunch of sbatch commands, can I use sacct (or something else) to show me the original sbatch command line for a given job ID? Thanks David

[slurm-users] When using RequeueExit in Slurm.conf, can you limit the # of requeues?

2021-07-01 Thread David Henkemeyer
Hello, I am investigating Slurm's ability to do requeuing of jobs. I like the fact that I can set RequeueExit= in the slurm.conf file, since this will automatically requeue jobs that exit with the specified exit codes. But, is there a way to limit the # of requeues? Thanks David

[slurm-users] New node w/ 3 GPUs is not accepting GPUs tasks

2021-06-23 Thread David Henkemeyer
batch --export=NONE -N 1 --constraint foo --wrap "ls" Submitted batch job 385 Thanks for the help, David

[slurm-users] Question about adding and removing features in Slurm

2021-06-18 Thread David Henkemeyer
r will manually edit slurm.conf to add/remove features? I've searched the docs and this seems to be the case, but I just wanted to check with the experts to be sure. Thanks so much, David

Re: [slurm-users] Maui equivalent Nodeallocationpolicy

2021-06-08 Thread David Chaffin
nd I think this is working SelectTypeParameters=CR_Core,CR_Pack_Nodes,CR_CORE_DEFAULT_DIST_BLOCK Thanks, David On Mon, Jun 7, 2021 at 2:44 PM David Chaffin wrote: > Hi all, > > we get a lot of small sub-node jobs that we want to pack together. Maui > does this pretty well with the s

[slurm-users] Maui equivalent Nodeallocationpolicy

2021-06-07 Thread David Chaffin
Hi all, we get a lot of small sub-node jobs that we want to pack together. Maui does this pretty well with the smallest node that will hold the job, NODEALLOCATIONPOLICY MINRESOURCE I can't figure out the slurm equivalent. Default backfill isn't working well. Anyone know of one? Thanks, David

Re: [slurm-users] slurmrestd

2021-06-06 Thread David Schanzenbach
and http-parser-devel under CentOS. Thanks, David On 6/6/2021 1:15 PM, Sid Young wrote: Hi all, I'm interested in using the slurmrestd but it does not appear to be built when you do an rpmbuild reading though the docs does not indicate a switch needed to include it (unless I missed

Re: [slurm-users] What is an easy way to prevent users run programs on the, master/login node.

2021-05-20 Thread David Schanzenbach
ee more login node abuse, we would probably try and layer on the use of cgroups to try and limit memory and cpu usage. Thanks, David Date: Wed, 19 May 2021 19:00:38 +0300 From: Alan Orth To: Ole Holm Nielsen , Slurm User Community List Subject: Re: [slurm-users] What is an easy way to pre

Re: [slurm-users] Different GPU types on the same server

2021-05-14 Thread David Gauchard
s` on each line should partition the host or not (=> CPUs=0-3 for all lines) david On 5/14/21 12:28 PM, Emyr James wrote: Dear all, We currently have a single gpu capable server with 10x RTX2080Ti in it. One of our research groups wants to replace one of these cards with an RTX3090 but o

Re: [slurm-users] Configless mode enabling issue

2021-05-07 Thread David Henkemeyer
Thank you for the reply, Will! The slurm.conf file only has one line in it: AutoDetect=nvml During my debug, I copied this file from the GPU node to the controller. But, that's when I noticed that the node w/o a GPU then crashed on startup. David On Fri, May 7, 2021 at 12:14 PM Will D

[slurm-users] Configless mode enabling issue

2021-05-07 Thread David Henkemeyer
Hello all. My team is enabling slurm (version 20.11.5) in our environment, and we got a controller up and running, along with 2 nodes. Everything was working fine. However, when we try to enable configless mode, I ran into a problem. The node that has a GPU is coming up in "drained" state, and s

[slurm-users] slurmd -C vs lscpu - which do I use to populate slurm.conf?

2021-04-28 Thread David Henkemeyer
ting: NodeName=devops2 CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=1 RealMemory=9913 Why is there a discrepancy? Which should I use to populate slurm.conf? The OS of this machine is Centos 8. Thank you, David

[slurm-users] Questions about adding new nodes to Slurm

2021-04-27 Thread David Henkemeyer
ources for adding/removing nodes to Slurm would be much appreciated. Perhaps there is a "toolkit" out there to automate some of these operations (which is what I already have for PBS, and will create for Slurm, if something doesn't already exist). Thank you all, David

Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
RAMPercent=100.00 MaxSwapPercent=100.00 MinRAMSpace=200 Cheers, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:preh

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
One possible datapoint: on the node where the job ran, there were two slurmstepd processes running, both at 100%CPU even after the job had ended. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
t 16e9 rows in the original file. Saved output .mat file is only 1.8kB. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki git

Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
0 CPU Efficiency: 11.96% of 2-09:10:56 core-walltime Job Wall-clock time: 03:34:26 Memory Utilized: 1.54 GB Memory Efficiency: 1.21% of 128.00 GB -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu

[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

2021-03-15 Thread Chin,David
m=0,node=1 83387.extern extern node001 03:34:26 COMPLETED 0:0 128Gn 460K153196K billing=16,cpu=16,node=1 Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 21

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
My mistake - from slurm.conf(5): SrunProlog runs on the node where the "srun" is executing. i.e. the login node, which explains why the directory is not being created on the compute node, while the echos work. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
creating the directory in (chmod 1777 for the parent directory is good) Brian Andrus On 3/4/2021 9:03 AM, Chin,David wrote: Hi, Brian: So, this is my SrunProlog script -- I want a job-specific tmp dir, which makes for easy cleanup at end of job: #!/bin/bash if [[ -z ${SLURM_ARRAY_JOB

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Chin,David
62 dwc62 6 Mar 4 11:52 /local/scratch/80472/ node001::~$ exit So, the "echo" and "whoami" statements are executed by the prolog script, as expected, but the mkdir commands are not? Thanks, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu

Re: [slurm-users] prolog not passing env var to job

2021-03-03 Thread Chin,David
shell on the compute node does not have the env variables set. I use the same prolog script as TaskProlog, which sets it properly for jobs submitted with sbatch. Thanks in advance, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.57

[slurm-users] sreport cluster AccountUtilizationByUser showing utilization of a deleted account

2021-02-09 Thread Chin,David
er the urcfadm account. Is there a way to fix this without just purging all the data? If there is no "graceful" fix, is there a way I can "reset" the slurm_acct_db, i.e. actually purge all data in all tables? Thanks in advance, Dave -- David Chin, PhD

Re: [slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?

2021-02-09 Thread Chin,David
Steps Suspend Usage This generated various usage dump files, and the job_table and step_table dumps. -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki

[slurm-users] Unsetting a QOS Flag?

2021-02-08 Thread Chin,David
ags=DenyOnLimit", and "sacctmgr modify qos foo set Flags=NoDenyOnLimit", to no avail. Thanks in advance, Dave -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusm

[slurm-users] sacctmgr archive dump - no dump file produced, and data not purged?

2021-02-05 Thread Chin,David
;s".) Is there something I am missing? Thanks, Dave Chin -- David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel dw...@drexel.edu 215.571.4335 (o) For URCF support: urcf-supp...@drexel.edu https://proteusmaster.urcf.drexel.edu/urcfwiki github:prehensilecode Drexel Internal Data

Re: [slurm-users] Validating SLURM sreport cluster utilization report

2021-01-29 Thread David Simpson
Out of interest (for those that do record and/or report on uptime) if you aren't using the sreport cluster utilization report what alternative method are you using instead? If you are using sreport cluster utilization report have you encountered this? thanks David - David Si

[slurm-users] Validating SLURM sreport cluster utilization report

2021-01-22 Thread David Simpson
ems with 3 nodes. So at the moment off the top of the head we don't understand this reported Down time. Is anyone else relying on sreport for this metric? If so have you encountered this sort of situation? regards David - David Simpson - Senior Systems Engineer ARCCA, Redwood

[slurm-users] Backfill pushing jobs back

2021-01-04 Thread David Baker
recent version of slurm would still have a backfill issue that starves larger job out. We're wondering if you have forgotten to configure something very fundamental, for example. Best regards, David

Re: [slurm-users] Backfill pushing jobs back

2020-12-21 Thread David Baker
e any parameter that we need to set to activate the backfill patch, for example? Best regards, David From: slurm-users on behalf of Chris Samuel Sent: 09 December 2020 16:37 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Backfill pushing jobs back CA

Re: [slurm-users] Backfill pushing jobs back

2020-12-10 Thread David Baker
st regards, David From: slurm-users on behalf of Chris Samuel Sent: 09 December 2020 16:37 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Backfill pushing jobs back CAUTION: This e-mail originated outside the University of Southampton. Hi David, On

[slurm-users] Backfill pushing jobs back

2020-12-09 Thread David Baker
ect behaviour? It is also weird that the pending jobs don't have a start time. I have increased the backfill parameters significantly, but it doesn't seem to affect this at all. SchedulerParameters=bf_window=14400,bf_resolution=2400,bf_max_job_user=80,bf_continue,default_queue_depth=1000,bf_interval=60 Best regards, David

[slurm-users] ninja and cmake

2020-11-24 Thread David Bellot
e and distcc exist and I use them, but here I want to test if it's possible to do it with Slurm (as a proof of concept). Cheers, David

Re: [slurm-users] unable to run on all the logical cores

2020-10-11 Thread David Bellot
result, or should I rather launch 20 jobs per node and have each job split in two internally (using "parallel" or "future" for example)? On Thu, Oct 8, 2020 at 6:32 PM William Brown wrote: > R is single threaded. > > On Thu, 8 Oct 2020, 07:44 Diego Zuccato, wrot

Re: [slurm-users] Controlling access to idle nodes

2020-10-08 Thread David Baker
Thank you very much for your comments. Oddly enough, I came up with the 3-partition model as well once I'd sent my email. So, your comments helped to confirm that I was thinking on the right lines. Best regards, David From: slurm-users on behalf of Thom

Re: [slurm-users] unable to run on all the logical cores

2020-10-07 Thread David Bellot
chtools in this case) the jobs. I'm still investigating even if NumCPUs=1 now as it should be. Thanks. David On Thu, Oct 8, 2020 at 4:40 PM Rodrigo Santibáñez < rsantibanez.uch...@gmail.com> wrote: > Hi David, > > I had the same problem time ago when configuring my f

[slurm-users] unable to run on all the logical cores

2020-10-07 Thread David Bellot
why TRES=cpu=2 Any idea on how to solve this problem and have 100% of the logical cores allocated? Best regards, David

[slurm-users] Controlling access to idle nodes

2020-10-06 Thread David Baker
like a two-way scavenger situation. Could anyone please help? I have, by the way, set up partition-based pre-emption in the cluster. This allows the general public to scavenge nodes owned by research groups. Best regards, David

[slurm-users] Accounts and QOS settings

2020-10-01 Thread David Baker
partition. My thought was to have two overlapping partitions each with the relevant QOS and account group access control. Perhaps I am making this too complicated. I would appreciate your advice, please. Best regards, David

  1   2   >