[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Christopher Samuel via slurm-users
On 4/10/24 10:41 pm, archisman.pathak--- via slurm-users wrote: In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? Mark the node as "DOWN" in Slurm, this is what we do when we get job

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread archisman.pathak--- via slurm-users
In our case, that node has been removed from the cluster and cannot be added back right now ( is being used for some other work ). What can we do in such a case? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread archisman.pathak--- via slurm-users
Could you give more details regarding this and how you debugged the same? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Upgrading nodes

2024-04-10 Thread Brian Andrus via slurm-users
Yes. You can build the 8 rpms on 9. Look at 'mock' to do so. I did similar when I still had to support EL7 Fairly generic plan, the devil is in the details and verifying each step, but those are the basic bases you need to touch. Brian Andrus On 4/10/2024 1:48 PM, Steve Berg via slurm-users

[slurm-users] Upgrading nodes

2024-04-10 Thread Steve Berg via slurm-users
I just finished migrating a few dozen blade servers from torque to slurm.  They're all running Alma 8 currently with the slurm that is available from epel.  I do want to get it all upgraded to running Alma 9 and the current version of slurm.  Got one system set up as the slurmctld system runnin

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Cutts, Tim via slurm-users
We have Weka filesystems on one of our clusters and saw this; we discovered we had slightly misconfigured the weka client and the result was that Weka’s and SLURMs cgroups were fighting with each other, and this seemed to be the result. Fixing the weka cgroups config improved the problem, for u

[slurm-users] Re: Jobs of a user are stuck in Completing stage for a long time and cannot cancel them

2024-04-10 Thread Paul Edmon via slurm-users
Usually to clear jobs like this you have to reboot the node they are on. That will then force the scheduler to clear them. -Paul Edmon- On 4/10/2024 2:56 AM, archisman.pathak--- via slurm-users wrote: We are running a slurm cluster with version `slurm 22.05.8`. One of our users has reported t

[slurm-users] Re: single node configuration

2024-04-10 Thread Steffen Grunewald via slurm-users
On Tue, 2024-04-09 at 11:07:32 -0700, Slurm users wrote: > Hi everyone, I'm conducting some tests. I've just set up SLURM on the head > node and haven't added any compute nodes yet. I'm trying to test it to > ensure it's working, but I'm encountering an error: 'Nodes required for the > job are DOWN

[slurm-users] Re: Avoiding fragmentation

2024-04-10 Thread Williams, Jenny Avis via slurm-users
Various options that might help reduce job fragmentation. Turn up debugging on slurmctld and add the DebugFlags like TraceJobs, SelectType, and Steps. With debugging set high enough one can see a good bit of the logic in regard to node selection. CR_LLN Schedule

[slurm-users] visualisation of JobComp and JobacctGather data with Grafana - screenshots, ideas?

2024-04-10 Thread Josef Dvoracek via slurm-users
Is here anybody having nice visualization of JobComp and JobacctGather data in Grafana? I save JobComp data in Elasticsearch, JobacctGather data in influxDB, and thinking about how to provide meaningful insights to $users. Things I'd like to show..: especially memory & cpu utilization, job r