Hi folks,
I'm getting fed up receiving out-of-office replies to slurm job state mails.
Given that by default slurmctld just calls /bin/mail (aka mailx on our
systems) which doesn't allow command line options to add headers such
as 'Auto-generated: auto-submitted' to help educate auto responders,
Still stuck with this; maybe this gives an idea to someone. Tried
resetting the RawUsage by forcing slurm to regenerate assoc_usage, and
though the file was generated, the RawUsage for all users now is stuck in
0. This makes me think there is a communication problem with slurmdbd
(which through sre
and it looks like i'll have to wait till 20.11 for a fix
https://bugs.schedmd.com/show_bug.cgi?id=9035
On Wed, Aug 26, 2020 at 11:20 AM Michael Di Domenico
wrote:
>
> looks like a similar issue is being tracked by:
> https://bugs.schedmd.com/show_bug.cgi?id=9441
>
> On Wed, Aug 26, 2020 at 11:04
looks like a similar issue is being tracked by:
https://bugs.schedmd.com/show_bug.cgi?id=9441
On Wed, Aug 26, 2020 at 11:04 AM Michael Di Domenico
wrote:
>
> sorry i meant to say, our slurm nodehealth script pushed the node to
> failed state. slurm itself wasn't doing this
>
> On Wed, Aug 26, 20
sorry i meant to say, our slurm nodehealth script pushed the node to
failed state. slurm itself wasn't doing this
On Wed, Aug 26, 2020 at 11:02 AM Michael Di Domenico
wrote:
>
> i just upgraded from v18 to v20. Did something change in the node
> config validation? it used to be that if i start
i just upgraded from v18 to v20. Did something change in the node
config validation? it used to be that if i started slurm on a compute
node that had lower than expected memory or was missing gpu's, slurm
would push a node into a failed state that i could see in sinfo -R.
now it seems to be loggi
Hello Durai,
you did not specify the amount of memory in your node configuration.
Perhaps it defaults to 1MB and so your 1MB-job already uses all the
memory that the scheduler thinks the node has...?
What does "scontrol show node slurm-gpu-1" say? Look for the
"RealMemory" field in the outpu
Hello,
this is my node configuration:
NodeName=slurm-gpu-1 NodeAddr=192.168.0.200 Procs=16 Gres=gpu:2
State=UNKNOWN
NodeName=slurm-gpu-2 NodeAddr=192.168.0.124 Procs=1 Gres=gpu:0
State=UNKNOWN
PartitionName=gpu Nodes=slurm-gpu-1 Default=NO MaxTime=INFINITE
AllowAccounts=whitelist,gpu_users Stat
Hi Herbert,
just like Angelos described, we also have logic in our poweroff script that
checks if the node is really IDLE and only sends the poweroff command if that's
the case.
Excerpt:
hosts=$(scontrol show hostnames $1)
for host in $hosts; do
scontrol show node $host | tr ' ' '\n' |