On 2/14/19 8:02 AM, Mahmood Naderan wrote:
One job is in RH state which means JobHoldMaxRequeue. The output file, specified by --output shows nothing suspicious. Is there any way to analyze the stuck job?
This happens when a job fails to start for MAX_BATCH_REQUEUE times (which is 5 at the moment).
Check your controller and slurmd logs to see what goes wrong when Slurm tries to start it.
All the best, Chris