Actually I hit sent too quickly, what I meant (assuming bash) is
for a in $(scontrol show hostname whatever_list); do touch $a; done
with the same whatever_list being $SLURM_JOB_NODELIST
On Fri, Feb 14, 2025 at 1:18 PM Davide DelVento
wrote:
> Not sure I completely understand what you need, bu
Not sure I completely understand what you need, but if I do... How about
touch whatever_prefix_$(scontrol show hostname whatever_list)
where whatever_list could be your $SLURM_JOB_NODELIST ?
On Fri, Feb 14, 2025 at 9:42 AM John Hearns via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> I
I believe in absence of other reasons, slurm assigns nodes to jobs in the
order they are listed in the partition definitions of slurm.conf -- perhaps
for whatever reason the node 41 appears first there, rather than 01?
On Thu, Jan 9, 2025 at 7:24 AM Dan Healy via slurm-users <
slurm-users@lists.sc
Wonderful. Thanks Ole for the reminder! I had bookmarked your wiki (of
course!) but forgot to check it out in this case. I'll add a more prominent
reminder to self in my notes to always check it!
Happy new year everybody once again
On Tue, Jan 7, 2025 at 1:58 AM Ole Holm Nielsen via slurm-users <
Found it, I should have asked to my puppet as it's mandatory in some places
:-D
It is simply
scontrol show hostname gpu[01-02],node[03-04,12-22,27-32,36]
Sorry for the noise
On Mon, Jan 6, 2025 at 12:55 PM Davide DelVento
wrote:
> Hi all,
> I remember seeing on this list a slurm command to cha
Hi all,
I remember seeing on this list a slurm command to change a slurm-friendly
list such as
gpu[01-02],node[03-04,12-22,27-32,36]
into a bash friendly list such as
gpu01
gpu02
node03
node04
node12
etc
I made a note about it but I can't find my note anymore, nor the relevant
message. Can some
Good sleuthing.
It would be nice if Slurm would say something like
Reason=Priority_Lower_Than_Job_ so people will immediately find the
culprit in such situations. Has anybody with a SchedMD subscription ever
asked something like that, or is there some reasons for which it'd be
impossible (or t
Mmmm, from https://slurm.schedmd.com/sbatch.html
> By default both standard output and standard error are directed to a file
of the name "slurm-%j.out", where the "%j" is replaced with the job
allocation number.
Perhaps at your site there's a configuration which uses separate error
files? See the
Ciao Diego,
I find it extremely hard to understand situations like this. I wish Slurm
were more clear on how it reported what it is doing, but I digress...
I suspect that there are other job(s) which have higher priority than this
one which are supposed to run on that node but cannot start because
Another possible use case of this is a regular MPI job where the
first/controller task often uses more memory than the workers and may need
to be scheduled on a higher memory node than them. I think I saw this
happening in the past, but I'm not 100% sure it was in Slurm or some other
scheduling sys
Not sure if I understand your use case, but if I do I am not sure if Slurm
provides that functionality.
If it doesn't (and if my understanding is correct), you can still achieve
your goal by:
1) removing sbatch and salloc from user's path
2) writing your own custom scripts named sbatch (and hard/s
Slurm 18? Isn't that a bit outdated?
On Fri, Sep 27, 2024 at 9:41 AM Robert Kudyba via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> We're in the process of upgrading but first we're moving to RHEL 9. My
> attempt to compile using rpmbuild -v -ta --define "_lto_cflags %{nil}"
> slurm-18.
Thanks everybody once again and especially Paul: your job_summary script
was exactly what I needed, served on a golden plate. I just had to
modify/customize the date range and change the following line (I can make a
PR if you want, but it's such a small change that it'd take more time to
deal with
Ciao Fabio,
That for sure is syntactically incorrect, because the way sbatch parsing
works: as soon as it finds a non-empy non-comment line (your first srun) it
will stop parsing for #SBATCH directives. So assuming this is a single file
as it looks from the formatting, the second hetjob and the cl
owing that the problem won't happen again in the future.
Thanks and have a great weekend
On Fri, Aug 23, 2024 at 8:00 AM Ole Holm Nielsen via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> Hi Davide,
>
> On 8/22/24 21:30, Davide DelVento via slurm-users wrote:
> >
I am confused by the reported amount of Down and PLND Down by sreport.
According to it, our cluster would have had a significant amount of
downtime, which I know didn't happen (or, according to the documentation
"time that slurmctld was not responding", see
https://slurm.schedmd.com/sreport.html)
Hi Ole,
On Wed, Aug 21, 2024 at 1:06 PM Ole Holm Nielsen via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> The slurmacct script can actually break down statistics by partition,
> which I guess is what you're asking for? The usage of the command is:
>
Yes, this is almost what I was askin
;
> inside jobs to emulate a login session, causing a heavy load on your
> servers.
>
> /Ole
>
> On 8/21/24 01:13, Davide DelVento via slurm-users wrote:
> > Thanks Kevin and Simon,
> >
> > The full thing that you do is indeed overkill, however I was able to
> l
Thanks Kevin and Simon,
The full thing that you do is indeed overkill, however I was able to learn
how to collect/parse some of the information I need.
What I am still unable to get is:
- utilization by queue (or list of node names), to track actual use of
expensive resources such as GPUs, high
Since each instance of the program is independent and you are using one
core for each, it'd be better to leave slurm deal with that and schedule
them concurrently as it sees fit. Maybe you simply need to add some
directive to allow shared jobs on the same node.
Alternatively (if at your site jobs m
g text output of squeue command)
>
> cheers
>
> josef
>
> --
> *From:* Davide DelVento via slurm-users
> *Sent:* Wednesday, 14 August 2024 01:52
> *To:* Paul Edmon
> *Cc:* Reid, Andrew C.E. (Fed) ; Jeffrey T Frey <
> f...@udel.edu>; slurm-users@lists.schedm
I too would be interested in some lightweight scripts. XDMOD in my
experience has been very intense in workload to install, maintain and
learn. It's great if one needs that level of interactivity, granularity and
detail, but for some "quick and dirty" summary in a small dept it's not
only overkill,
How about SchedMD itself? They are the ones doing most (if not all) of the
development, and they are great.
In my experience, the best options are either SchedMD or the vendor of your
hardware.
On Mon, Aug 12, 2024 at 11:17 PM John Joseph via slurm-users <
slurm-users@lists.schedmd.com> wrote:
>
I am pretty sure with vanilla slurm is impossible.
What it might be possible (maybe) is submitting 5 core jobs and using some
pre-post scripts which immediately before the job start change the
requested number of cores to "however are currently available on the node
where it is scheduled to run".
In part, it depends on how it's been configured, but have you tried
--exclusive?
On Thu, Aug 1, 2024 at 7:39 AM Henrique Almeida via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> Hello, everyone, with slurm, how to allocate a whole node for a
> single multi-threaded process?
>
>
> https:
I think the best way to do it would be to schedule the 10 things to be a
single slurm job and then use some of the various MPMD ways (the nitty
gritty details depend if each executable is serial, OpenMP, MPI or hybrid).
On Mon, Jul 8, 2024 at 2:20 PM Dan Healy via slurm-users <
slurm-users@lists.s
I don't really have an answer for you, just responding to make your message
pop out in the "flood" of other topics we've got since you posted.
On our cluster we configure cancelling our jobs because it makes more sense
for our situation, so I have no experience with that resume from being
suspende
Not exactly the answer to your question (which I don't know) but if you can
get to prefix whatever is executed with this
https://github.com/NCAR/peak_memusage (which also uses getrusage) or a
variant you will be able to do that.
On Thu, May 16, 2024 at 4:10 PM Emyr James via slurm-users <
slurm-us
{
"emoji": "👍",
"version": 1
}
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Are you seeking something simple rather than sophisticated? If so, you can
use the controller local disk for StateSaveLocation and place a cron job
(on the same node or somewhere else) to take that data out via e.g. rsync
and put it where you need it (NFS?) for the backup control node to use
if/whe
Hi Jason,
I wanted exactly the same and was confused exactly like you. For a while it
did not work, regardless of what I tried, but eventually (with some help) I
figured it out.
What I set up and it is working fine is this globally
PreemptType = preempt/partition_prio
PreemptMode=REQUEUE
and th
Yes, that is what we are also doing and it works well.
Note that requesting a batch script for another user, one sees nothing
(rather than an error message saying that one does not have permissions)
On Fri, Feb 16, 2024 at 12:48 PM Paul Edmon via slurm-users <
slurm-users@lists.schedmd.com> wrote:
The simple answer is to just add a line such as
Licenses=whatever:20
and then request your users to use the -L option as described at
https://slurm.schedmd.com/licenses.html
This works very well, however it does not do enforcement like Slurm does
with other resources. You will find posts in this
Hi Sylvain,
For the series better late than never, is this still a problem?
If so, is this a new install or an update?
Whan environment/compiler are you using? The error
undefined reference to `__nv_init_env'
seems to indicate that you are doing something cuda-related which I think
you should not
If you would like the high watermark memory utilization after the job
completes, https://github.com/NCAR/peak_memusage is a great tool. Of course
it has the limitation that you need to know that you want that information
*before* starting the job, which might or might not a problem for your use
cas
35 matches
Mail list logo