Davide DelVento <davide.quan...@gmail.com> writes: > 2. How to debug the issue?
I'd try capturing all stdout and stderr from the script into a file on the compute node, for instance like this: exec &> /root/prolog_slurmd.$$ set -x # To print out all commands before any other commands in the script. The "prolog_slurmd.<pid>" will then contain a log of all commands executed in the script, along with all output (stdout and stderr). If there is no "prolog_slurmd.<pid>" file after the job has been scheduled, then as has been pointed out by others, slurm wasn't able to exec the prolog at all. > Even increasing the debug level the > slurmctld.log contains simply a "error: validate_node_specs: Prolog or > job env setup failure on node xxx, draining the node" message, without > even a line number or anything. Slurm only executes the prolog script. It doesn't parse it or evaluate it itself, so it has no way of knowing what fails inside the script. > 3. And more generally, how to debug a prolog (and epilog) script > without disrupting all production jobs? Unfortunately we can't have > another slurm install for testing, is there a sbatch option to force > utilizing a prolog script which would not be executed for all the > other jobs? Or perhaps making a dedicated queue? I tend to reserve a node, install the updated prolog scripts there, and run test jobs asking for that reservation. (Otherwise one could always set up a small cluster of VMs and use that for simpler testing.) -- B/H
signature.asc
Description: PGP signature