[slurm-users] Re: Create filenames based on slurm hosts

2025-02-14 Thread Davide DelVento via slurm-users
Actually I hit sent too quickly, what I meant (assuming bash) is for a in $(scontrol show hostname whatever_list); do touch $a; done with the same whatever_list being $SLURM_JOB_NODELIST On Fri, Feb 14, 2025 at 1:18 PM Davide DelVento wrote: > Not sure I completely understand what you need, bu

[slurm-users] Re: Create filenames based on slurm hosts

2025-02-14 Thread Davide DelVento via slurm-users
Not sure I completely understand what you need, but if I do... How about touch whatever_prefix_$(scontrol show hostname whatever_list) where whatever_list could be your $SLURM_JOB_NODELIST ? On Fri, Feb 14, 2025 at 9:42 AM John Hearns via slurm-users < slurm-users@lists.schedmd.com> wrote: > I

[slurm-users] Re: Unexpected node got allocation

2025-01-09 Thread Davide DelVento via slurm-users
I believe in absence of other reasons, slurm assigns nodes to jobs in the order they are listed in the partition definitions of slurm.conf -- perhaps for whatever reason the node 41 appears first there, rather than 01? On Thu, Jan 9, 2025 at 7:24 AM Dan Healy via slurm-users < slurm-users@lists.sc

[slurm-users] Re: formatting node names

2025-01-07 Thread Davide DelVento via slurm-users
Wonderful. Thanks Ole for the reminder! I had bookmarked your wiki (of course!) but forgot to check it out in this case. I'll add a more prominent reminder to self in my notes to always check it! Happy new year everybody once again On Tue, Jan 7, 2025 at 1:58 AM Ole Holm Nielsen via slurm-users <

[slurm-users] Re: formatting node names

2025-01-06 Thread Davide DelVento via slurm-users
Found it, I should have asked to my puppet as it's mandatory in some places :-D It is simply scontrol show hostname gpu[01-02],node[03-04,12-22,27-32,36] Sorry for the noise On Mon, Jan 6, 2025 at 12:55 PM Davide DelVento wrote: > Hi all, > I remember seeing on this list a slurm command to cha

[slurm-users] formatting node names

2025-01-06 Thread Davide DelVento via slurm-users
Hi all, I remember seeing on this list a slurm command to change a slurm-friendly list such as gpu[01-02],node[03-04,12-22,27-32,36] into a bash friendly list such as gpu01 gpu02 node03 node04 node12 etc I made a note about it but I can't find my note anymore, nor the relevant message. Can some

[slurm-users] Re: Job not starting

2024-12-10 Thread Davide DelVento via slurm-users
Good sleuthing. It would be nice if Slurm would say something like Reason=Priority_Lower_Than_Job_ so people will immediately find the culprit in such situations. Has anybody with a SchedMD subscription ever asked something like that, or is there some reasons for which it'd be impossible (or t

[slurm-users] Re: error and output files

2024-12-09 Thread Davide DelVento via slurm-users
Mmmm, from https://slurm.schedmd.com/sbatch.html > By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. Perhaps at your site there's a configuration which uses separate error files? See the

[slurm-users] Re: Job not starting

2024-12-06 Thread Davide DelVento via slurm-users
Ciao Diego, I find it extremely hard to understand situations like this. I wish Slurm were more clear on how it reported what it is doing, but I digress... I suspect that there are other job(s) which have higher priority than this one which are supposed to run on that node but cannot start because

[slurm-users] Re: Change primary alloc node

2024-10-31 Thread Davide DelVento via slurm-users
Another possible use case of this is a regular MPI job where the first/controller task often uses more memory than the workers and may need to be scheduled on a higher memory node than them. I think I saw this happening in the past, but I'm not 100% sure it was in Slurm or some other scheduling sys

[slurm-users] Re: Job pre / post submit scripts

2024-10-28 Thread Davide DelVento via slurm-users
Not sure if I understand your use case, but if I do I am not sure if Slurm provides that functionality. If it doesn't (and if my understanding is correct), you can still achieve your goal by: 1) removing sbatch and salloc from user's path 2) writing your own custom scripts named sbatch (and hard/s

[slurm-users] Re: errors compiling Slurm 18 on RHEL 9: [Makefile:577: scancel] Error 1 & It's not recommended to have unversioned Obsoletes

2024-09-27 Thread Davide DelVento via slurm-users
Slurm 18? Isn't that a bit outdated? On Fri, Sep 27, 2024 at 9:41 AM Robert Kudyba via slurm-users < slurm-users@lists.schedmd.com> wrote: > We're in the process of upgrading but first we're moving to RHEL 9. My > attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}" > slurm-18.

[slurm-users] Re: Print Slurm Stats on Login

2024-08-28 Thread Davide DelVento via slurm-users
Thanks everybody once again and especially Paul: your job_summary script was exactly what I needed, served on a golden plate. I just had to modify/customize the date range and change the following line (I can make a PR if you want, but it's such a small change that it'd take more time to deal with

[slurm-users] Re: Spread a multistep job across clusters

2024-08-26 Thread Davide DelVento via slurm-users
Ciao Fabio, That for sure is syntactically incorrect, because the way sbatch parsing works: as soon as it finds a non-empy non-comment line (your first srun) it will stop parsing for #SBATCH directives. So assuming this is a single file as it looks from the formatting, the second hetjob and the cl

[slurm-users] Re: Slurmdbd purge and reported downtime

2024-08-23 Thread Davide DelVento via slurm-users
owing that the problem won't happen again in the future. Thanks and have a great weekend On Fri, Aug 23, 2024 at 8:00 AM Ole Holm Nielsen via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hi Davide, > > On 8/22/24 21:30, Davide DelVento via slurm-users wrote: > >

[slurm-users] Slurmdbd purge and reported downtime

2024-08-22 Thread Davide DelVento via slurm-users
I am confused by the reported amount of Down and PLND Down by sreport. According to it, our cluster would have had a significant amount of downtime, which I know didn't happen (or, according to the documentation "time that slurmctld was not responding", see https://slurm.schedmd.com/sreport.html)

[slurm-users] Re: Print Slurm Stats on Login

2024-08-21 Thread Davide DelVento via slurm-users
Hi Ole, On Wed, Aug 21, 2024 at 1:06 PM Ole Holm Nielsen via slurm-users < slurm-users@lists.schedmd.com> wrote: > The slurmacct script can actually break down statistics by partition, > which I guess is what you're asking for? The usage of the command is: > Yes, this is almost what I was askin

[slurm-users] Re: Print Slurm Stats on Login

2024-08-21 Thread Davide DelVento via slurm-users
; > inside jobs to emulate a login session, causing a heavy load on your > servers. > > /Ole > > On 8/21/24 01:13, Davide DelVento via slurm-users wrote: > > Thanks Kevin and Simon, > > > > The full thing that you do is indeed overkill, however I was able to > l

[slurm-users] Re: Print Slurm Stats on Login

2024-08-20 Thread Davide DelVento via slurm-users
Thanks Kevin and Simon, The full thing that you do is indeed overkill, however I was able to learn how to collect/parse some of the information I need. What I am still unable to get is: - utilization by queue (or list of node names), to track actual use of expensive resources such as GPUs, high

[slurm-users] Re: Unable to run sequential jobs simultaneously on the same node

2024-08-19 Thread Davide DelVento via slurm-users
Since each instance of the program is independent and you are using one core for each, it'd be better to leave slurm deal with that and schedule them concurrently as it sees fit. Maybe you simply need to add some directive to allow shared jobs on the same node. Alternatively (if at your site jobs m

[slurm-users] Re: Print Slurm Stats on Login

2024-08-14 Thread Davide DelVento via slurm-users
g text output of squeue command) > > cheers > > josef > > -- > *From:* Davide DelVento via slurm-users > *Sent:* Wednesday, 14 August 2024 01:52 > *To:* Paul Edmon > *Cc:* Reid, Andrew C.E. (Fed) ; Jeffrey T Frey < > f...@udel.edu>; slurm-users@lists.schedm

[slurm-users] Re: Print Slurm Stats on Login

2024-08-13 Thread Davide DelVento via slurm-users
I too would be interested in some lightweight scripts. XDMOD in my experience has been very intense in workload to install, maintain and learn. It's great if one needs that level of interactivity, granularity and detail, but for some "quick and dirty" summary in a small dept it's not only overkill,

[slurm-users] Re: Seeking Commercial SLURM Subscription Provider

2024-08-13 Thread Davide DelVento via slurm-users
How about SchedMD itself? They are the ones doing most (if not all) of the development, and they are great. In my experience, the best options are either SchedMD or the vendor of your hardware. On Mon, Aug 12, 2024 at 11:17 PM John Joseph via slurm-users < slurm-users@lists.schedmd.com> wrote: >

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-02 Thread Davide DelVento via slurm-users
I am pretty sure with vanilla slurm is impossible. What it might be possible (maybe) is submitting 5 core jobs and using some pre-post scripts which immediately before the job start change the requested number of cores to "however are currently available on the node where it is scheduled to run".

[slurm-users] Re: With slurm, how to allocate a whole node for a single multi-threaded process?

2024-08-01 Thread Davide DelVento via slurm-users
In part, it depends on how it's been configured, but have you tried --exclusive? On Thu, Aug 1, 2024 at 7:39 AM Henrique Almeida via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello, everyone, with slurm, how to allocate a whole node for a > single multi-threaded process? > > > https:

[slurm-users] Re: Can SLURM queue different jobs to start concurrently?

2024-07-08 Thread Davide DelVento via slurm-users
I think the best way to do it would be to schedule the 10 things to be a single slurm job and then use some of the various MPMD ways (the nitty gritty details depend if each executable is serial, OpenMP, MPI or hybrid). On Mon, Jul 8, 2024 at 2:20 PM Dan Healy via slurm-users < slurm-users@lists.s

[slurm-users] Re: Best practice for jobs resuming from suspended state

2024-05-16 Thread Davide DelVento via slurm-users
I don't really have an answer for you, just responding to make your message pop out in the "flood" of other topics we've got since you posted. On our cluster we configure cancelling our jobs because it makes more sense for our situation, so I have no experience with that resume from being suspende

[slurm-users] Re: memory high water mark reporting

2024-05-16 Thread Davide DelVento via slurm-users
Not exactly the answer to your question (which I don't know) but if you can get to prefix whatever is executed with this https://github.com/NCAR/peak_memusage (which also uses getrusage) or a variant you will be able to do that. On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users < slurm-us

[slurm-users] Re: Partition Preemption Configuration Question

2024-05-08 Thread Davide DelVento via slurm-users
{ "emoji": "👍", "version": 1 } -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: StateSaveLocation and Slurm HA

2024-05-07 Thread Davide DelVento via slurm-users
Are you seeking something simple rather than sophisticated? If so, you can use the controller local disk for StateSaveLocation and place a cron job (on the same node or somewhere else) to take that data out via e.g. rsync and put it where you need it (NFS?) for the backup control node to use if/whe

[slurm-users] Re: Partition Preemption Configuration Question

2024-05-02 Thread Davide DelVento via slurm-users
Hi Jason, I wanted exactly the same and was confused exactly like you. For a while it did not work, regardless of what I tried, but eventually (with some help) I figured it out. What I set up and it is working fine is this globally PreemptType = preempt/partition_prio PreemptMode=REQUEUE and th

[slurm-users] Re: Recover Batch Script Error

2024-02-16 Thread Davide DelVento via slurm-users
Yes, that is what we are also doing and it works well. Note that requesting a batch script for another user, one sees nothing (rather than an error message saying that one does not have permissions) On Fri, Feb 16, 2024 at 12:48 PM Paul Edmon via slurm-users < slurm-users@lists.schedmd.com> wrote:

[slurm-users] Re: Need help managing licence

2024-02-16 Thread Davide DelVento via slurm-users
The simple answer is to just add a line such as Licenses=whatever:20 and then request your users to use the -L option as described at https://slurm.schedmd.com/licenses.html This works very well, however it does not do enforcement like Slurm does with other resources. You will find posts in this

[slurm-users] Re: Compilation question

2024-02-09 Thread Davide DelVento via slurm-users
Hi Sylvain, For the series better late than never, is this still a problem? If so, is this a new install or an update? Whan environment/compiler are you using? The error undefined reference to `__nv_init_env' seems to indicate that you are doing something cuda-related which I think you should not

[slurm-users] Re: Memory used per node

2024-02-09 Thread Davide DelVento via slurm-users
If you would like the high watermark memory utilization after the job completes, https://github.com/NCAR/peak_memusage is a great tool. Of course it has the limitation that you need to know that you want that information *before* starting the job, which might or might not a problem for your use cas