I left out a critical
let task+=1
in that pseudo-loop 😁
griznog
On Sun, Sep 15, 2024 at 6:44 PM John Hanks wrote:
> No ideas on fixing this in Slurm, but in userspace in the past when faced
> with huge array jobs which had really short jobs like this I've nudged them
> towa
No ideas on fixing this in Slurm, but in userspace in the past when faced
with huge array jobs which had really short jobs like this I've nudged them
toward batching up array elements in each job to extend it. Say the user
wants to run 5 tasks, 30 seconds each. Batching those up in groups of
10
Hi Fritz,
Purely theoretical and untested solution, but it may work to "cp
/usr/bin/sshd /usr/bin/sshd2" and then use that sshd2 binary to run an sshd
service on a different port, with a config limiting it to sftp only and a
`/etc/pam.d/sshd2` file that does not enforce pam_slurm_adopt. Downside i
I've done similar by having the epilog touch a file, then have the node
health check (LBNL NHC) act on that file's presence/contents later to do
the heavy lifting. There's a window of time/delay where the reason is
"Epilog error" before the health check corrects it, but if that's tolerable
this mak
Thanks, Greg! This looks like the right way to do this. I will have to stop
putting off learning to use spank plugins :)
griznog
On Wed, Apr 6, 2022 at 1:40 AM Greg Wickham
wrote:
> Hi John, Mark,
>
>
>
> We use a spank plugin
> https://gitlab.com/greg.wickham/slurm-spank-private-tmpdir (this w
I've thought-experimented this in the past, wanting to do the same thing
but haven't found any way to get a/dev/shm or a tmpfs into a job's cgroups
to be accounted against the job's allocation. The best I have come up with
is creating a per-job tmpfs from a prolog, removing from epilog and setting
Do you have a matching Gres=gpu:4 or similar in your node config lines? I'm
not sure if that is still required, but we have it in our config which does
work to isolate GPUs to jobs they are assigned to.
griznog
On Wed, Mar 23, 2022 at 9:45 AM wrote:
> Hi, all:
>
>
>
> We found a problem that sl
Hi Philippe,
The built-in --x11 support still has some rough edges to it. Due to a
timeout issue with getting xauth set up, I've been patching SLURM with
these minor changes:
--- src/common/run_command.c_orig 2018-02-06 15:03:10.0 -0800
+++ src/common/run_command.c 2018-02-06 15:26:23.525
HI,
Short answer: scontrol update jobid=JOBID arraytaskthrottle=NEWLIMIT
Long answer: https://bugs.schedmd.com/show_bug.cgi?id=1863
jbh
On Sat, Feb 24, 2018 at 5:55 AM, Bill Barth wrote:
> We don’t allow array jobs (we have our own tools for packing small jobs
> into bigger ones), so I can
I've used this with some success:
https://github.com/JohannesBuchner/verynice. For CPU intensive things it
works great, but you have to also set some memory limits in limits.conf if
users do any large memory stuff. Otherwise I just use a problem process as
a chance to start a conversation with that
10 matches
Mail list logo