Keep the /etc/password, group synced to all the nodes should work. And it
will need to set up an SSH key for MPI.
Best,
Feng
On Mon, Feb 10, 2025 at 10:29 PM mark.w.moorcroft--- via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> If you set up slurm elastic cloud in EC2 without LDAP, wha
You can also check https://github.com/prod-feng/slurm_tools
slurm_job_perf_show.py may be helpful.
I used to try to use slurm_job_perf_show_email.py to send emails to
users to summarize their usage, like monthly. While some users seemed
to get confused, so stopped.
Best,
Feng
On Fri, Aug 9, 20
yes, the algorithm should be like that 1 cpu (core) per job(task).
Like someone mentioned already, need to to --oversubscribe=10 on cpu
cores, meaning 10 jobs on each core for you case. Slurm.conf.
Best,
Feng
On Fri, Jun 21, 2024 at 6:52 AM Arnuld via slurm-users
wrote:
>
> > Every job will need
Hi All,
I am having trouble calculating the real RSS memory usage by some kind
of users' jobs. Which the sacct returned wrong numbers.
Rocky Linux release 8.5, Slurm 21.08
(slurm.conf)
ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/linux
The troubling jobs are like:
1. python
Do you have containers setting?
On Tue, May 14, 2024 at 3:57 PM Feng Zhang wrote:
>
> Not sure, very strange, while the two linux-vdso.so.1 looks different:
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
> linux-vdso.so.1 (0x7ffde81ee000)
>
>
> [deej@mo
4/libdl.so.2 (0x14a9d82c9000)
> libm.so.6 => /lib64/libm.so.6 (0x14a9d7f25000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x14a9d82ae000)
> libc.so.6 => /lib64/libc.so.6 (0x14a9d7c0)
> /lib64/ld-linux-x86-64.so.2 (0x14a9d8306000)
>
&g
Looks more like a runtime environment issue.
Check the binaries:
ldd /mnt/local/ollama/ollama
on both clusters and comparing the output may give some hints.
Best,
Feng
On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
wrote:
>
> I'm running into a strange issue and I'm hoping anoth
Try
scontrol update NodeName=heimdall state=DOWN Reason="gpu issue"
and then
scontrol update NodeName=heimdall state=RESUME
to see if it will work. Probably just SLURM daemon having a hiccup
after you made changes.
Best,
Feng
On Mon, Oct 16, 2023 at 10:43 AM Gregor Hagelueken
wrote:
>
> Hi,
As I read again on the pasted slurm.conf info, it includes
"AllowAccounts, AllowGroups,", so it seems slurm actually takes this
into account. So I think it should work...
Best,
Feng
On Thu, Sep 21, 2023 at 2:33 PM Feng Zhang wrote:
>
> As I said I am not sure, but it depends
threshold."
Best,
Feng
On Thu, Sep 21, 2023 at 11:48 AM Bernstein, Noam CIV USN NRL (6393)
Washington DC (USA) wrote:
>
> On Sep 21, 2023, at 11:37 AM, Feng Zhang wrote:
>
> Set slurm.conf parameter: EnforcePartLimits=ANY or NO may help this, not sure.
>
>
> Hmm, in
Set slurm.conf parameter: EnforcePartLimits=ANY or NO may help this, not sure.
Best,
Feng
Best,
Feng
On Thu, Sep 21, 2023 at 11:27 AM Jason Simms wrote:
>
> I personally don't think that we should assume users will always know which
> partitions are available to them. Ideally, of course, th
drain reason=stuck; scontrol
> update nodename= state=resume
>
>
> Best
> Marcus
>
> Am 20.09.2023 um 09:11 schrieb Ole Holm Nielsen:
> > On 9/20/23 01:39, Feng Zhang wrote:
> >> Restarting the slurmd dameon of the compute node should work, if the
> >> node is
Restarting the slurmd dameon of the compute node should work, if the
node is still online and normal.
Best,
Feng
On Tue, Sep 19, 2023 at 8:03 AM Felix wrote:
>
> Hello
>
> I have a job on my system which is running more than its time, more than
> 4 days.
>
> 1808851 debug gridjob atlas01
You can try command as:
scontrol update partition mypart Nodes=node[1-90],ab,ac #exclude the
one you want to remove
"Changing the Nodes in a partition has no effect upon jobs that have
already begun execution."
Best,
Feng
On Fri, Aug 4, 2023 at 10:47 AM Pacey, Mike wrote:
>
> Hi folks,
>
>
Looks like Slurm itself only supports that format(in MB unit). Slurm
commands output format is not very user friendly to me. If it can add
some easy options, like for the output info of sinfo command in this email
thread, how about adding support for lazy options, like sinfo -ABC, etc.
For the des
Very interesting issue.
I am guessing there might be a workaround: SInce oryx has 2 gpus
instead, you can define both of them, but disable the GT 710? Does
Slurm support this?
Best,
Feng
Best,
Feng
On Tue, Jun 27, 2023 at 9:54 AM Wilson, Steven M wrote:
>
> Hi,
>
> I manually configure the
Mohamad,
It seems you need to upgrade the GCC on the GPU nodes of cluster A and C.
The error message says that the srun needs newer GCC libs. Or you can
downgrade your SLURM(like recompile it using GCC 2.27 or older) on cluster
A/C.
Best,
Feng
On Tue, Jul 4, 2023 at 2:46 PM mohammed shambakey
hello all,
I am doing some tests using the Slurm. Just found that when I run the
srun command with -n and -c options, when the -n and -c are odd
numbers, srun job hangs and no shell is given to me. When I check
using "squeue", it reports that this job is actually running.
When -C = even number
Not sure if it works, but you can try using "\${SLURM_ARRAY_JOB_ID}.
The "\" to escape the early evaluation of the env variables.
On Thu, Nov 10, 2022 at 6:53 PM Chase Schuette wrote:
>
> Due to needing to support existing HPC workflows. I have a need to pass a
> bash script within a python subp
19 matches
Mail list logo