from:"Davide DelVento"

[slurm-users] Re: Create filenames based on slurm hosts

2025-02-14 Thread Davide DelVento via slurm-users

Actually I hit sent too quickly, what I meant (assuming bash) is for a in $(scontrol show hostname whatever_list); do touch $a; done with the same whatever_list being $SLURM_JOB_NODELIST On Fri, Feb 14, 2025 at 1:18 PM Davide DelVento wrote: > Not sure I completely understand what you n

[slurm-users] Re: Create filenames based on slurm hosts

2025-02-14 Thread Davide DelVento via slurm-users

Not sure I completely understand what you need, but if I do... How about touch whatever_prefix_$(scontrol show hostname whatever_list) where whatever_list could be your $SLURM_JOB_NODELIST ? On Fri, Feb 14, 2025 at 9:42 AM John Hearns via slurm-users < slurm-users@lists.schedmd.com> wrote: > I

[slurm-users] Re: Unexpected node got allocation

2025-01-09 Thread Davide DelVento via slurm-users

I believe in absence of other reasons, slurm assigns nodes to jobs in the order they are listed in the partition definitions of slurm.conf -- perhaps for whatever reason the node 41 appears first there, rather than 01? On Thu, Jan 9, 2025 at 7:24 AM Dan Healy via slurm-users < slurm-users@lists.sc

[slurm-users] Re: formatting node names

2025-01-07 Thread Davide DelVento via slurm-users

Wonderful. Thanks Ole for the reminder! I had bookmarked your wiki (of course!) but forgot to check it out in this case. I'll add a more prominent reminder to self in my notes to always check it! Happy new year everybody once again On Tue, Jan 7, 2025 at 1:58 AM Ole Holm Nielsen via slurm-users <

[slurm-users] Re: formatting node names

2025-01-06 Thread Davide DelVento via slurm-users

Found it, I should have asked to my puppet as it's mandatory in some places :-D It is simply scontrol show hostname gpu[01-02],node[03-04,12-22,27-32,36] Sorry for the noise On Mon, Jan 6, 2025 at 12:55 PM Davide DelVento wrote: > Hi all, > I remember seeing on this list a slurm

[slurm-users] formatting node names

2025-01-06 Thread Davide DelVento via slurm-users

Hi all, I remember seeing on this list a slurm command to change a slurm-friendly list such as gpu[01-02],node[03-04,12-22,27-32,36] into a bash friendly list such as gpu01 gpu02 node03 node04 node12 etc I made a note about it but I can't find my note anymore, nor the relevant message. Can some

[slurm-users] Re: Job not starting

2024-12-10 Thread Davide DelVento via slurm-users

> Diego > > Il 07/12/2024 10:03, Diego Zuccato via slurm-users ha scritto: > > Ciao Davide. > > > > Il 06/12/2024 16:42, Davide DelVento ha scritto: > > > >> I find it extremely hard to understand situations like this. I wish > >> Slurm were more

[slurm-users] Re: error and output files

2024-12-09 Thread Davide DelVento via slurm-users

Mmmm, from https://slurm.schedmd.com/sbatch.html > By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. Perhaps at your site there's a configuration which uses separate error files? See the

[slurm-users] Re: Job not starting

2024-12-06 Thread Davide DelVento via slurm-users

Ciao Diego, I find it extremely hard to understand situations like this. I wish Slurm were more clear on how it reported what it is doing, but I digress... I suspect that there are other job(s) which have higher priority than this one which are supposed to run on that node but cannot start because

[slurm-users] Re: Change primary alloc node

2024-10-31 Thread Davide DelVento via slurm-users

Another possible use case of this is a regular MPI job where the first/controller task often uses more memory than the workers and may need to be scheduled on a higher memory node than them. I think I saw this happening in the past, but I'm not 100% sure it was in Slurm or some other scheduling sys

[slurm-users] Re: Job pre / post submit scripts

2024-10-28 Thread Davide DelVento via slurm-users

Not sure if I understand your use case, but if I do I am not sure if Slurm provides that functionality. If it doesn't (and if my understanding is correct), you can still achieve your goal by: 1) removing sbatch and salloc from user's path 2) writing your own custom scripts named sbatch (and hard/s

[slurm-users] Re: errors compiling Slurm 18 on RHEL 9: [Makefile:577: scancel] Error 1 & It's not recommended to have unversioned Obsoletes

2024-09-27 Thread Davide DelVento via slurm-users

Slurm 18? Isn't that a bit outdated? On Fri, Sep 27, 2024 at 9:41 AM Robert Kudyba via slurm-users < slurm-users@lists.schedmd.com> wrote: > We're in the process of upgrading but first we're moving to RHEL 9. My > attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}" > slurm-18.

[slurm-users] Re: Print Slurm Stats on Login

2024-08-28 Thread Davide DelVento via slurm-users

Thanks everybody once again and especially Paul: your job_summary script was exactly what I needed, served on a golden plate. I just had to modify/customize the date range and change the following line (I can make a PR if you want, but it's such a small change that it'd take more time to deal with

[slurm-users] Re: Spread a multistep job across clusters

2024-08-26 Thread Davide DelVento via slurm-users

Ciao Fabio, That for sure is syntactically incorrect, because the way sbatch parsing works: as soon as it finds a non-empy non-comment line (your first srun) it will stop parsing for #SBATCH directives. So assuming this is a single file as it looks from the formatting, the second hetjob and the cl

[slurm-users] Re: Slurmdbd purge and reported downtime

2024-08-23 Thread Davide DelVento via slurm-users

owing that the problem won't happen again in the future. Thanks and have a great weekend On Fri, Aug 23, 2024 at 8:00 AM Ole Holm Nielsen via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hi Davide, > > On 8/22/24 21:30, Davide DelVento via slurm-users wrote: > >

[slurm-users] Slurmdbd purge and reported downtime

2024-08-22 Thread Davide DelVento via slurm-users

I am confused by the reported amount of Down and PLND Down by sreport. According to it, our cluster would have had a significant amount of downtime, which I know didn't happen (or, according to the documentation "time that slurmctld was not responding", see https://slurm.schedmd.com/sreport.html)

[slurm-users] Re: Print Slurm Stats on Login

2024-08-21 Thread Davide DelVento via slurm-users

Hi Ole, On Wed, Aug 21, 2024 at 1:06 PM Ole Holm Nielsen via slurm-users < slurm-users@lists.schedmd.com> wrote: > The slurmacct script can actually break down statistics by partition, > which I guess is what you're asking for? The usage of the command is: > Yes, this is almost what I was askin

[slurm-users] Re: Print Slurm Stats on Login

2024-08-21 Thread Davide DelVento via slurm-users

; > inside jobs to emulate a login session, causing a heavy load on your > servers. > > /Ole > > On 8/21/24 01:13, Davide DelVento via slurm-users wrote: > > Thanks Kevin and Simon, > > > > The full thing that you do is indeed overkill, however I was able to > l

[slurm-users] Re: Print Slurm Stats on Login

2024-08-20 Thread Davide DelVento via slurm-users

Thanks Kevin and Simon, The full thing that you do is indeed overkill, however I was able to learn how to collect/parse some of the information I need. What I am still unable to get is: - utilization by queue (or list of node names), to track actual use of expensive resources such as GPUs, high

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Davide DelVento via slurm-users

Since each instance of the program is independent and you are using one core for each, it'd be better to leave slurm deal with that and schedule them concurrently as it sees fit. Maybe you simply need to add some directive to allow shared jobs on the same node. Alternatively (if at your site jobs m

[slurm-users] Re: Print Slurm Stats on Login

2024-08-14 Thread Davide DelVento via slurm-users

g text output of squeue command) > > cheers > > josef > > -- > *From:* Davide DelVento via slurm-users > *Sent:* Wednesday, 14 August 2024 01:52 > *To:* Paul Edmon > *Cc:* Reid, Andrew C.E. (Fed) ; Jeffrey T Frey < > f...@udel.edu>; slurm-users@lists.schedm

[slurm-users] Re: Print Slurm Stats on Login

2024-08-13 Thread Davide DelVento via slurm-users

I too would be interested in some lightweight scripts. XDMOD in my experience has been very intense in workload to install, maintain and learn. It's great if one needs that level of interactivity, granularity and detail, but for some "quick and dirty" summary in a small dept it's not only overkill,

[slurm-users] Re: Seeking Commercial SLURM Subscription Provider

2024-08-13 Thread Davide DelVento via slurm-users

How about SchedMD itself? They are the ones doing most (if not all) of the development, and they are great. In my experience, the best options are either SchedMD or the vendor of your hardware. On Mon, Aug 12, 2024 at 11:17 PM John Joseph via slurm-users < slurm-users@lists.schedmd.com> wrote: >

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-02 Thread Davide DelVento via slurm-users

I am pretty sure with vanilla slurm is impossible. What it might be possible (maybe) is submitting 5 core jobs and using some pre-post scripts which immediately before the job start change the requested number of cores to "however are currently available on the node where it is scheduled to run".

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-01 Thread Davide DelVento via slurm-users

In part, it depends on how it's been configured, but have you tried --exclusive? On Thu, Aug 1, 2024 at 7:39 AM Henrique Almeida via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello, everyone, with slurm, how to allocate a whole node for a > single multi-threaded process? > > > https:

[slurm-users] Re: Can SLURM queue different jobs to start concurrently?

2024-07-08 Thread Davide DelVento via slurm-users

I think the best way to do it would be to schedule the 10 things to be a single slurm job and then use some of the various MPMD ways (the nitty gritty details depend if each executable is serial, OpenMP, MPI or hybrid). On Mon, Jul 8, 2024 at 2:20 PM Dan Healy via slurm-users < slurm-users@lists.s

[slurm-users] Re: Best practice for jobs resuming from suspended state

2024-05-16 Thread Davide DelVento via slurm-users

I don't really have an answer for you, just responding to make your message pop out in the "flood" of other topics we've got since you posted. On our cluster we configure cancelling our jobs because it makes more sense for our situation, so I have no experience with that resume from being suspende

[slurm-users] Re: memory high water mark reporting

2024-05-16 Thread Davide DelVento via slurm-users

Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusage (which also uses getrusage) or a variant you will be able to do that. On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users < slurm-us

[slurm-users] Re: Partition Preemption Configuration Question

2024-05-08 Thread Davide DelVento via slurm-users

{ "emoji": "👍", "version": 1 } -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: StateSaveLocation and Slurm HA

2024-05-07 Thread Davide DelVento via slurm-users

Are you seeking something simple rather than sophisticated? If so, you can use the controller local disk for StateSaveLocation and place a cron job (on the same node or somewhere else) to take that data out via e.g. rsync and put it where you need it (NFS?) for the backup control node to use if/whe

[slurm-users] Re: Partition Preemption Configuration Question

2024-05-02 Thread Davide DelVento via slurm-users

Hi Jason, I wanted exactly the same and was confused exactly like you. For a while it did not work, regardless of what I tried, but eventually (with some help) I figured it out. What I set up and it is working fine is this globally PreemptType = preempt/partition_prio PreemptMode=REQUEUE and th

[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Davide DelVento via slurm-users

Yes, that is what we are also doing and it works well. Note that requesting a batch script for another user, one sees nothing (rather than an error message saying that one does not have permissions) On Fri, Feb 16, 2024 at 12:48 PM Paul Edmon via slurm-users < slurm-users@lists.schedmd.com> wrote:

[slurm-users] Re: Need help managing licence

2024-02-16 Thread Davide DelVento via slurm-users

The simple answer is to just add a line such as Licenses=whatever:20 and then request your users to use the -L option as described at https://slurm.schedmd.com/licenses.html This works very well, however it does not do enforcement like Slurm does with other resources. You will find posts in this

[slurm-users] Re: Compilation question

2024-02-09 Thread Davide DelVento via slurm-users

Hi Sylvain, For the series better late than never, is this still a problem? If so, is this a new install or an update? Whan environment/compiler are you using? The error undefined reference to `__nv_init_env' seems to indicate that you are doing something cuda-related which I think you should not

[slurm-users] Re: Memory used per node

2024-02-09 Thread Davide DelVento via slurm-users

If you would like the high watermark memory utilization after the job completes, https://github.com/NCAR/peak_memusage is a great tool. Of course it has the limitation that you need to know that you want that information *before* starting the job, which might or might not a problem for your use cas

Re: [slurm-users] propose environment variables SLURM_STDOUT, SLURM_STDERR, SLURM_STDIN

2024-01-22 Thread Davide DelVento

I think it would be useful, yes, and mostly for the epilog script. In the job script itself, you are creating such files, so some of the proposed use cases are a bit tricky to get right in the way you described them. For example, if you scp these files, you are scp'ing them to their status before

Re: [slurm-users] preemptable queue

2024-01-12 Thread Davide DelVento

ight try setting that default of PreemptMode=CANCEL and then set > specific PreemptModes for all your partitions. That's what we do and it > works for us. > > -Paul Edmon- > On 1/12/2024 10:33 AM, Davide DelVento wrote: > > Thanks Paul, > > I don't understand

Re: [slurm-users] preemptable queue

2024-01-12 Thread Davide DelVento

; work, I don't see anything in the documentation that indicates it > wouldn't. So I suspect you have a typo somewhere in your conf. > > -Paul Edmon- > On 1/11/2024 6:01 PM, Davide DelVento wrote: > > I would like to add a preemptable queue to our cluster. Actually I already

[slurm-users] preemptable queue

2024-01-11 Thread Davide DelVento

I would like to add a preemptable queue to our cluster. Actually I already have. We simply want jobs submitted to that queue be preempted if there are no resources available for jobs in other (high priority) queues. Conceptually very simple, no conditionals, no choices, just what I wrote. However i

Re: [slurm-users] Reproducible irreproducible problem (timeout?)

2023-12-20 Thread Davide DelVento

Not an answer to your question, but if the jobs need to be subdivided, why not submit smaller jobs? Also, this does not sound like a slurm problem, but rather a code or infrastructure issue. Finally, are you typically able to ssh into the main node of each subtask? In many places that is not allo

Re: [slurm-users] powersave: excluding nodes

2023-12-18 Thread Davide DelVento

the other thing was true: I had two lines one specifying job_script and the other job_comment and only the last one was honored until I noticed and consolidated them in one line, comma-separating the arguments... On Mon, Dec 11, 2023 at 9:52 AM Davide DelVento wrote: > Forgot to mention: this

Re: [slurm-users] [External] Re: Troubleshooting job stuck in Pending state

2023-12-12 Thread Davide DelVento

uler’s decisions on a pending job by running “qstat > -j jobid”. But there doesn’t seem to be any functional equivalent with > SLURM? > > > > Regards, > > Mike > > > > > > *From:* slurm-users *On Behalf Of > *Davide DelVento > *Sent:* Monday, Dec

Re: [slurm-users] powersave: excluding nodes

2023-12-11 Thread Davide DelVento

Forgot to mention: this is with slurm 23.02.6 (apologize for the double message) On Mon, Dec 11, 2023 at 9:49 AM Davide DelVento wrote: > Following the example from https://slurm.schedmd.com/power_save.html > regarding SuspendExcNodes > > I configured my slurm.conf with > > Su

Re: [slurm-users] Slurm powersave

2023-12-11 Thread Davide DelVento

In case it's useful to others: I've been able to get this working by having the "no action" script stop the slurmd daemon and start it *with the -b option*. On Fri, Oct 6, 2023 at 4:28 AM Ole Holm Nielsen wrote: > Hi Davide, > > On 10/5/23 15:28, Davide DelVento wro

[slurm-users] powersave: excluding nodes

2023-12-11 Thread Davide DelVento

Following the example from https://slurm.schedmd.com/power_save.html regarding SuspendExcNodes I configured my slurm.conf with SuspendExcNodes=node[01-12]:2,node[13-32]:2,node[33-34]:1,nodegpu[01-02]:1 SuspendExcStates=down,drain,fail,maint,not_responding,reserved #SuspendExcParts= (the nodes in

Re: [slurm-users] Troubleshooting job stuck in Pending state

2023-12-11 Thread Davide DelVento

By getting "stuck" do you mean the job stays PENDING forever or does eventually run? I've seen the latter (and I agree with you that I wish Slurm will log things like "I looked at this job and I am not starting it yet because") but not the former On Fri, Dec 8, 2023 at 9:00 AM Pacey, Mike wro

Re: [slurm-users] Disabling SWAP space will it effect SLURM working

2023-12-11 Thread Davide DelVento

A little late here, but yes everything Hans said is correct and if you are worried about slurm (or other critical system software) getting killed by OOM, you can workaround it by properly configuring cgroup. On Wed, Dec 6, 2023 at 2:06 AM Hans van Schoot wrote: > Hi Joseph, > > This might depend

Re: [slurm-users] slurm power save question

2023-11-29 Thread Davide DelVento

t down while I am looking at it. > > Although, I do agree, the functionality of being able to have "keep at > least X nodes up and idle" would be nice, that is not how I see this > documented or working. > > Brian Andrus > On 11/23/2023 5:12 AM, Davide DelVento wrote

Re: [slurm-users] slurm power save question

2023-11-23 Thread Davide DelVento

ave at least X nodes up", > which includes running jobs. So it stops any wait time for the first X jobs > being submitted, but any jobs after that will need to wait for the power_up > sequence. > > Brian Andrus > On 11/22/2023 6:58 AM, Davide DelVento wrote: > >

Re: [slurm-users] Dynamic MIG Question

2023-11-22 Thread Davide DelVento

I assume you mean the sentence about dynamic MIG at https://slurm.schedmd.com/gres.html#MIG_Management Could it be supported? I think so, but only if one of their paying customers (that could be you) asks for it. On Wed, Nov 22, 2023 at 11:24 AM Aaron Kollmann < aaron.kollm...@student.hpi.de> wrot

[slurm-users] slurm power save question

2023-11-22 Thread Davide DelVento

I've started playing with powersave and have a question about SuspendExcNodes. The documentation at https://slurm.schedmd.com/power_save.html says For example nid[10-20]:4 will prevent 4 usable nodes (i.e IDLE and not DOWN, DRAINING or already powered down) in the set nid[10-20] from being powered

Re: [slurm-users] SLURM new user query, does SLURM has GUI /Web based management version also

2023-11-20 Thread Davide DelVento

Not sure if that's what you are looking for, Joseph, but I believe ClusterVisor and Bright do provide some basic Slurm management as a web GUI. I don't think either is available outside of the support for hw purchased from the respective vendors. See e.g. https://www.advancedclustering.com/products

Re: [slurm-users] job_desc.pn-min-memory in LUA jobsubmit-plugin

2023-11-17 Thread Davide DelVento

I don't have an answer for you, but I found your message in my spam folder. I brought it out and I'm replying to it in the hope that it gets some visibility in people's mailboxes. Note that in the US it's SC week and many people are or have been busy with it and will be travelling in the next days

Re: [slurm-users] REST-based CLI tools out there somewhere?

2023-11-10 Thread Davide DelVento

> > Having a large number of researchers able to run arbitrary code on the > same submit host has a marked tendency to result in an overloaded host. > There are various ways to regulate that ranging from "constant scolding" to > "aggressive quotas/cgroups/etc", but all involve some degree of > inco

Re: [slurm-users] REST-based CLI tools out there somewhere?

2023-11-09 Thread Davide DelVento

Not a direct answer to your question, but have you looked at Open OnDemand? Or maybe JupyterHub? I think most places today prefer to do either of those which provide somewhat the functionality you asked - and much more. On Thu, Nov 9, 2023 at 4:17 PM Chip Seraphine wrote: > Hello, > > Our users

Re: [slurm-users] SLURM , maximum scalable instance is which one

2023-11-01 Thread Davide DelVento

Not sure if it's the largest, but LUMI is a very large one https://www.top500.org/system/180048/ https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/partitions/ On Sun, Oct 29, 2023 at 4:16 AM John Joseph wrote: > Dear All, > Like to know that what is the maximum scalled up instance of SL

Re: [slurm-users] Sinfo options not working in SLURM 23.11

2023-10-30 Thread Davide DelVento

> > I am working on SLURM 23.11 version. > ??? Latest version is slurm-23.02.6 which one are you referring to? https://github.com/SchedMD/slurm/tags >

Re: [slurm-users] Question about gdb sbatch

2023-10-23 Thread Davide DelVento

quot;-g" in slurm/configure file, and I > wonder if I should add the "-g" to some other locations? > > Regards > > On Sat, Oct 21, 2023 at 12:47 AM Davide DelVento > wrote: > >> Have you compiled slurm yourself or have you installed binaries? If the &

Re: [slurm-users] Question about gdb sbatch

2023-10-20 Thread Davide DelVento

Have you compiled slurm yourself or have you installed binaries? If the latter, I speculate this is not possible, in that it would not have been compiled with the required symbols (above all "-g" but probably others depending on your platform). If you compiled slurm yourself, and assuming you have

Re: [slurm-users] Correct way to do logrotation

2023-10-17 Thread Davide DelVento

I'd be interested in this too, and I'm reposting only because the message was flagged as both "dangerous email" and "spam", so people may not have seen it (hopefully my reply will not suffer the same downfall...) On Mon, Oct 16, 2023 at 3:26 AM Taras Shapovalov wrote: > Hello, > > In the past it

Re: [slurm-users] hostlist members when calling resume and suspend with powersave

2023-10-06 Thread Davide DelVento

I don't think there is such a guarantee and in fact my reading of https://slurm.schedmd.com/power_save.html#images means that most likely the nodes can and will be mingled together and your script should untangle that. But as you probably guessed from my other message, I'm new to powersave in slur

Re: [slurm-users] Slurm powersave

2023-10-05 Thread Davide DelVento

Hi Ole, Thanks for getting back to me. > the great presentation > > from our own > I presented that talk at SLUG'23 :-) > Yes! That's why I wrote "from our own", but perhaps these are local slangs where I live (and English is my second language) > > 1) I'm not sure I fully understand ReconfigF

Re: [slurm-users] enabling job script archival

2023-10-05 Thread Davide DelVento

ct 4, 2023 at 7:47 PM Davide DelVento wrote: > And weirdly enough it has now stopped working again, after I did the > experimentation for power save described in the other thread. > That is really strange. At the highest verbosity level the logs just say > > slurmdbd: deb

Re: [slurm-users] enabling job script archival

2023-10-04 Thread Davide DelVento

:192.168.2.254 CONN:13 I reconfigured and reverted stuff to no change. Does anybody have any clue? On Tue, Oct 3, 2023 at 5:43 PM Davide DelVento wrote: > For others potentially seeing this on mailing list search, yes, I needed > that, which of course required creating an account charge which I

[slurm-users] Slurm powersave

2023-10-04 Thread Davide DelVento

I'm experimenting with slurm powersave and I have several questions. I'm following the guidance from https://slurm.schedmd.com/power_save.html and the great presentation from our own https://slurm.schedmd.com/SLUG23/DTU-SLUG23.pdf I am running slurm 23.02.3 1) I'm not sure I fully understand Reco

Re: [slurm-users] enabling job script archival

2023-10-03 Thread Davide DelVento

th active > users. > > -Paul Edmon- > On 10/3/23 9:01 AM, Davide DelVento wrote: > > By increasing the slurmdbd verbosity level, I got additional information, > namely the following: > > slurmdbd: error: couldn't get information for this user (null)(x

Re: [slurm-users] enabling job script archival

2023-10-03 Thread Davide DelVento

hanks! On Mon, Oct 2, 2023 at 9:20 AM Davide DelVento wrote: > Thanks Paul, this helps. > > I don't have any PrivateData line in either config file. According to the > docs, "By default, all information is visible to all users" so this should > not be an issue. I tried

Re: [slurm-users] enabling job script archival

2023-10-02 Thread Davide DelVento

ut that didn't change the behavior. On Mon, Oct 2, 2023 at 9:10 AM Paul Edmon wrote: > At least in our setup, users can see their own scripts by doing sacct -B > -j JOBID > > I would make sure that the scripts are being stored and how you have > PrivateData set. > > -P

Re: [slurm-users] enabling job script archival

2023-10-02 Thread Davide DelVento

ssion" setting. FWIW, we use LDAP. Is that the expected behavior, in that by default only root can see the job scripts? I was assuming the users themselves should be able to debug their own jobs... Any hint on what could be changed to achieve this? Thanks! On Fri, Sep 29, 2023 at 5:48 

Re: [slurm-users] Verifying preemption WON'T happen

2023-09-29 Thread Davide DelVento

I don't really have an answer for you other than a "hallway comment", that it sounds like a good thing which I would test with a simulator, if I had one. I've been intrigued by (but really not looked much into) https://slurm.schedmd.com/SLUG23/LANL-Batsim-SLUG23.pdf On Fri, Sep 29, 2023 at 10:05 A

Re: [slurm-users] enabling job script archival

2023-09-29 Thread Davide DelVento

r each user in each location. > > Also it should be noted that there is no way to prune out job_scripts or > job_envs right now. So the only way to get rid of them if they get large is > to 0 out the column in the table. You can ask SchedMD for the mysql command > to do this a

[slurm-users] enabling job script archival

2023-09-28 Thread Davide DelVento

In my current slurm installation, (recently upgraded to slurm v23.02.3), I only have AccountingStoreFlags=job_comment I now intend to add both AccountingStoreFlags=job_script AccountingStoreFlags=job_env leaving the default 4MB value for max_script_size Do I need to do anything on the DB mysel

[slurm-users] slurmrestd memory leak

2023-08-22 Thread Davide DelVento

Has anyone else noticed this issue and knows more about it? https://bugs.schedmd.com/show_bug.cgi?id=16976 Mitigation by preventing users submitting many jobs works, but only to a point.

Re: [slurm-users] Tracking efficiency of all jobs on the cluster (dashboard etc.)

2023-07-24 Thread Davide DelVento

I run a cluster we bought from ACT and recently updated to ClusterVisor v1.0 The new version has (among many things) a really nice view of individual jobs resource utilization (GPUs, memory, CPU, temperature, etc). I did not pay attention to the overall statistics, so I am not sure how CV fares th

Re: [slurm-users] Decreasing time limit of running jobs (notification)

2023-07-10 Thread Davide DelVento

Actually rm -r does not give ANY warning, so in plain Linux "rm -r /" run as root would destroy your system without notice. Your particular Linux distro may have implemented safeguards with a shell alias such as `alias rm='rm -i'` and that's a common thing, but not guaranteed to be there On Thu, J

Re: [slurm-users] Nodes stuck in drain state

2023-05-25 Thread Davide DelVento

Can you ssh into the node and check the actual availability of memory? Maybe there is a zombie process (or a healthy one with a memory leak bug) that's hogging all the memory? On Thu, May 25, 2023 at 7:31 AM Roger Mason wrote: > Hello, > > Doug Meyer writes: > > > Could also review the node log

Re: [slurm-users] monitoring and accounting

2023-05-05 Thread Davide DelVento

At a place I worked before, we used XDMOD several years ago. It was a bit tricky to set up correctly and not exactly intuitive to get started with data collection as a user (managers, allocation specialists and other not-super-technical people were most of our users). But when familiarized with it,

Re: [slurm-users] sharing licences with non slurm workers

2023-03-24 Thread Davide DelVento

Ciao Matteo, If you look through the archives, you will see I struggled with this problem too. A few people suggested some alternatives, but in the end I did not find anything really satisfying which did not require a ton of work for me. Another piece of the story is users requesting a license bu

Re: [slurm-users] speed / efficiency of sacct vs. scontrol

2023-02-27 Thread Davide DelVento

> > And if you are seeing a workflow management system causing trouble on > > your system, probably the most sustainable way of getting this resolved > > is to file issues or pull requests with the respective project, with > > suggestions like the ones you made. For snakemake, a second good point >

Re: [slurm-users] [External] Re: actual time of start (or finish) of a job

2023-02-20 Thread Davide DelVento

gt; Cheers, > Florian > > From: slurm-users on behalf of Davide > DelVento > Sent: Thursday, 16 February 2023 01:40 > To: Slurm User Community List > Subject: [External] Re: [slurm-users] actual time of start (or finish) of a > job >

Re: [slurm-users] actual time of start (or finish) of a job

2023-02-15 Thread Davide DelVento

to the web version: > https://slurm.schedmd.com/sacct.html. > > Best, > > Joseph > > -- > Joseph F. Guzman - ITS (Advanced Research Computing) > > Northern Arizona University > > joseph.f.guz...@nau.edu >

[slurm-users] actual time of start (or finish) of a job

2023-02-15 Thread Davide DelVento

I have a user who needs to find the actual start (or finish) time of a number of jobs. With the elapsed field of sacct start or finish become equivalent for his search. I see that information in /var/log/slurm/slurmctld.log so Slurm should have it, however in sacct itself that information does not

Re: [slurm-users] [EXT]Re: srun with &&, |, and > oh my!

2023-01-24 Thread Davide DelVento

Try first with small things like shell scripts you write which would tell you where the thing is running (e.g. by using hostname). Keep in mind that what would happen will most importantly depend on the shell. For example, if you use "sudo" you know that using wildcards is tricky, because your user

Re: [slurm-users] How to read job accounting data long output? `sacct -l`

2022-12-14 Thread Davide DelVento

It would be very useful if there were a way (perhaps a custom script parsing the sacct output) to provide the information in the same format as "scontrol show job" Has anybody attempted to do that? On Wed, Dec 14, 2022 at 1:25 AM Will Furnass wrote: > > If you pipe output into 'less -S' then yo

Re: [slurm-users] Prolog and job_submit

2022-10-31 Thread Davide DelVento

Thanks for helping me find workarounds. > My only other thought is that you might be able to use node features & > job constraints to communicate this without the user realising. I am not sure I understand this approach. > For instance you could declare the nodes where the software is installed

Re: [slurm-users] Prolog and job_submit

2022-10-30 Thread Davide DelVento

Hi Chris, > Unfortunately it looks like the license request information doesn't get > propagated into any prologs from what I see from a scan of the > documentation. :-( Thanks. If I am reading you right, I did notice the same thing and in fact that's why I wrote that job_submit lua script which

Re: [slurm-users] Prolog and job_submit

2022-10-29 Thread Davide DelVento

ich user will execute the scripts > > https://slurm.schedmd.com/prolog_epilog.html > > Maybe the variable isn't set for the user executing the > prolog/epilog/taskprolog > > Jeff > > ____ > From: slurm-users on behalf of Davide >

[slurm-users] Prolog and job_submit

2022-10-29 Thread Davide DelVento

My problem: grant licensed software availability to my users only if they request it on slurm; for now with local licenses. I wrote a job_submit lua script which checks job_desc.licenses and if it contains the appropriate strings it sets an appropriate SOMETHING_LICENSE_REQ environmental variable.

Re: [slurm-users] How to debug a prolog script?

2022-10-29 Thread Davide DelVento

as I will describe in another thread. On Sun, Sep 18, 2022 at 11:57 PM Bjørn-Helge Mevik wrote: > > Davide DelVento writes: > > >> I'm curious: What kind of disruption did it cause for your production > >> jobs? > > > > All jobs failed and went in pendin

Re: [slurm-users] Check consistency

2022-10-12 Thread Davide DelVento

x27;s possible that dialing up the verbosity on the > slurmd logs may give that info but I haven't seen it in normal operating. > > -Paul Edmon- > > On 10/6/22 5:47 PM, Davide DelVento wrote: > > Is there a simple way to check that whas slurm is running is what the

[slurm-users] Check consistency

2022-10-06 Thread Davide DelVento

Is there a simple way to check that whas slurm is running is what the config say it should be? For example, my understanding is that changing cgroup.conf should be followed by 'systemctl stop slurmd' on all compute nodes, then 'systemctl restart slurmctld' on the head node, then 'systemctl start s

Re: [slurm-users] X11 forwarding, slurm-22.05.3, hostbased auth

2022-10-06 Thread Davide DelVento

Perhaps just a very trivial question, but it doesn't look you mentioned it: does your X-forwarding work from the login node? Maybe the X-server on your client is the problem and trying xclock on the login node would clarify that On Wed, Oct 5, 2022 at 12:03 PM Allan Streib wrote: > > Hi everyone,

Re: [slurm-users] Detecting non-MPI jobs running on multiple nodes

2022-09-29 Thread Davide DelVento

At my previous job there were cron jobs running everywhere measuring possibly idle cores which were eventually averaged out for the duration of the job, and reported (the day after) via email to the user support team. I believe they stopped doing so when compute became (relatively) cheap at the exp

Re: [slurm-users] slurm jobs and and amount of licenses (matlab)

2022-09-26 Thread Davide DelVento

Are your licenses used only for the slurm cluster(s) or are they shared with laptops, workstations and/or other computing equipment not managed by slurm? In the former case, the "local" licenses described in the documentation will do the trick (but slurm does not automatically enforce their use, so

Re: [slurm-users] remote license

2022-09-16 Thread Davide DelVento

scales well, but it looks like you have a rather beginner cluster that > would never be impacted by such choices. > > Brian Andrus > > > On 9/16/2022 10:00 AM, Davide DelVento wrote: > > Thanks Brian. > > > > I am still perplexed. What is a database to install, a

Re: [slurm-users] remote license

2022-09-16 Thread Davide DelVento

ironment. The 2nd step would be dependent on what things you are > tracking within that. > > Brian Andrus > > > On 9/16/2022 5:01 AM, Davide DelVento wrote: > > So if I understand correctly, this "remote database" is something that > > is neither part of slu

Re: [slurm-users] How to debug a prolog script?

2022-09-16 Thread Davide DelVento

Thanks a lot. > > Does it need the execution permission? For root alone sufficient? > > slurmd runs as root, so it only need exec perms for root. Perfect. That must have been then, since my script (like the example one) did not have the execution permission on. > I'm curious: What kind of disrup

Re: [slurm-users] How to debug a prolog script?

2022-09-16 Thread Davide DelVento

Thanks to both of you. > Permissions on the file itself (and the directories in the path to it) Does it need the execution permission? For root alone sufficient? > Existence of the script on the nodes (prologue is run on the nodes, not the > head) Yes, it's in a shared filesystem. > Not sure

Re: [slurm-users] remote license

2022-09-16 Thread Davide DelVento

a > certain number are allowed by each cluster and change that if needed. > > If you got creative, you could keep the license count that is in the > database updated to match the number free from flexlm to stop license > starvation due to users outside slurm using them up so they

[slurm-users] remote license

2022-09-15 Thread Davide DelVento

I am a bit confused by remote licenses. https://lists.schedmd.com/pipermail/slurm-users/2020-September/006049.html (which is only 2 years old) claims that they are just a counter, so like local licenses. Then why call them remote? Only a few days after, this https://lists.schedmd.com/pipermail/sl

1 2 >

1 - 100 of 111 matches

Mail list logo