I have a SLURM configuration of 2 hosts with 6 + 4 CPUs.
I am submitting jobs with sbatch -n .
However, I see that even when I have exhausted all 10 CPU slots for the running
jobs it's still allowing subsequent jobs to run !
The CPU slots availability is also show as full for the 2 hosts. No job
Dear SLUR Users and Administrators,
I am interested in a way to customize the job submission exit statuses (mainly
error codes) after the job has already been queued by the SLURM controller. We
aim to provide more user-friendly messages and reminders in case of any errors
or obstacles (also adj
You were right, I found that the slurm.conf file was different between the
controller node and the computes, so I've synchronized it now. I was also
considering setting up an epilogue script to help debug what happens after the
job finishes. Do you happen to have any examples of what an epilogue
Apologies for the trouble.
Just discovered that I had done some temporary tweaks in the code which was
prohibiting the reservation of
the resources. This was to be reverted back after the testing which I missed!
This in turn led to running of all jobs.
Please ignore the query.
-Bhaskar.
--
sl
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:
It seems like there's an issue with the termination process on these nodes. Any
thoughts on what could be causing this?
That usually means processes wedged in the kernel for some reason, in an
uninterruptible sleep state. You can define