You were right, I found that the slurm.conf file was different between the 
controller node and the computes, so I've synchronized it now. I was also 
considering setting up an epilogue script to help debug what happens after the 
job finishes. Do you happen to have any examples of what an epilogue script 
might look like?

However, I'm now encountering a different issue:

REASON               USER      TIMESTAMP           NODELIST
Kill task failed     root      2024-10-21T09:27:05 nodemm04
Kill task failed     root      2024-10-21T09:27:40 nodemm06

I also checked the logs and found the following entries:

On nodemm04:

[2024-10-21T09:27:06.000] [223608.extern] error: *** EXTERN STEP FOR 223608 
STEPD TERMINATED ON nodemm04 AT 2024-10-21T09:27:05 DUE TO JOB NOT ENDING WITH 
SIGNALS ***

On nodemm06:

[2024-10-21T09:27:40.000] [223828.extern] error: *** EXTERN STEP FOR 223828 
STEPD TERMINATED ON nodemm06 AT 2024-10-21T09:27:39 DUE TO JOB NOT ENDING WITH 
SIGNALS ***

It seems like there's an issue with the termination process on these nodes. Any 
thoughts on what could be causing this?

Thanks for your help!

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to