[slurm-users] Alternatives for MailProg

2020-08-26 Thread Andrew Elwell
Hi folks, I'm getting fed up receiving out-of-office replies to slurm job state mails. Given that by default slurmctld just calls /bin/mail (aka mailx on our systems) which doesn't allow command line options to add headers such as 'Auto-generated: auto-submitted' to help educate auto responders,

Re: [slurm-users] sshare RawUsage vs sreport usage

2020-08-26 Thread Stephan Schott
Still stuck with this; maybe this gives an idea to someone. Tried resetting the RawUsage by forcing slurm to regenerate assoc_usage, and though the file was generated, the RawUsage for all users now is stuck in 0. This makes me think there is a communication problem with slurmdbd (which through sre

Re: [slurm-users] slurm_rpc_node_registration invalid argument

2020-08-26 Thread Michael Di Domenico
and it looks like i'll have to wait till 20.11 for a fix https://bugs.schedmd.com/show_bug.cgi?id=9035 On Wed, Aug 26, 2020 at 11:20 AM Michael Di Domenico wrote: > > looks like a similar issue is being tracked by: > https://bugs.schedmd.com/show_bug.cgi?id=9441 > > On Wed, Aug 26, 2020 at 11:04

Re: [slurm-users] slurm_rpc_node_registration invalid argument

2020-08-26 Thread Michael Di Domenico
looks like a similar issue is being tracked by: https://bugs.schedmd.com/show_bug.cgi?id=9441 On Wed, Aug 26, 2020 at 11:04 AM Michael Di Domenico wrote: > > sorry i meant to say, our slurm nodehealth script pushed the node to > failed state. slurm itself wasn't doing this > > On Wed, Aug 26, 20

Re: [slurm-users] slurm_rpc_node_registration invalid argument

2020-08-26 Thread Michael Di Domenico
sorry i meant to say, our slurm nodehealth script pushed the node to failed state. slurm itself wasn't doing this On Wed, Aug 26, 2020 at 11:02 AM Michael Di Domenico wrote: > > i just upgraded from v18 to v20. Did something change in the node > config validation? it used to be that if i start

[slurm-users] slurm_rpc_node_registration invalid argument

2020-08-26 Thread Michael Di Domenico
i just upgraded from v18 to v20. Did something change in the node config validation? it used to be that if i started slurm on a compute node that had lower than expected memory or was missing gpu's, slurm would push a node into a failed state that i could see in sinfo -R. now it seems to be loggi

Re: [slurm-users] CR_Core_Memory behavior

2020-08-26 Thread Christoph BrĂ¼ning
Hello Durai, you did not specify the amount of memory in your node configuration. Perhaps it defaults to 1MB and so your 1MB-job already uses all the memory that the scheduler thinks the node has...? What does "scontrol show node slurm-gpu-1" say? Look for the "RealMemory" field in the outpu

Re: [slurm-users] CR_Core_Memory behavior

2020-08-26 Thread Durai Arasan
Hello, this is my node configuration: NodeName=slurm-gpu-1 NodeAddr=192.168.0.200 Procs=16 Gres=gpu:2 State=UNKNOWN NodeName=slurm-gpu-2 NodeAddr=192.168.0.124 Procs=1 Gres=gpu:0 State=UNKNOWN PartitionName=gpu Nodes=slurm-gpu-1 Default=NO MaxTime=INFINITE AllowAccounts=whitelist,gpu_users Stat

Re: [slurm-users] [External] [slurm 20.02.3] don't suspend nodes in down state

2020-08-26 Thread Florian Zillner
Hi Herbert, just like Angelos described, we also have logic in our poweroff script that checks if the node is really IDLE and only sends the poweroff command if that's the case. Excerpt: hosts=$(scontrol show hostnames $1) for host in $hosts; do scontrol show node $host | tr ' ' '\n' |