[slurm-dev] Re: Randomly jobs failures

Doug Meyer Tue, 11 Apr 2017 19:34:14 -0700

Can you include the submit script?  Recently found one sim of ours that
used sbatch with an srun call to the executable.  This was an array call.
The srun was unnecessary and its removal resolved the random fails.


Doug

On Tue, Apr 11, 2017 at 12:41 AM, Andrea del Monaco <
andrea.delmon...@clustervision.com> wrote:

> Hello There,
>
> Some of the jobs crashes without any apparent valid reason:
> Logs are the following:
> Controller:
> [2017-04-11T08:22:03+02:00] debug2: Processing RPC:
> MESSAGE_EPILOG_COMPLETE uid=0
> [2017-04-11T08:22:03+02:00] debug2: _slurm_rpc_epilog_complete
> JobId=830468 Node=cnode001 usec=60
> [2017-04-11T08:22:03+02:00] debug2: Processing RPC:
> MESSAGE_EPILOG_COMPLETE uid=0
> [2017-04-11T08:22:03+02:00] debug2: _slurm_rpc_epilog_complete
> JobId=830468 Node=cnode007 usec=25
> [2017-04-11T08:22:03+02:00] debug:  sched: Running job scheduler
> [2017-04-11T08:22:03+02:00] debug2: found 92 usable nodes from config
> containing cnode[001-100]
> [2017-04-11T08:22:03+02:00] debug2: select_p_job_test for job 830332
> [2017-04-11T08:22:03+02:00] sched: Allocate JobId=830332
> NodeList=cnode[001,007,022,030-033,041-044,047-048,052-054,058-061]
> #CPUs=320
> [2017-04-11T08:22:03+02:00] debug2: prolog_slurmctld job 830332 prolog
> completed
> [2017-04-11T08:22:03+02:00] error: Error opening file
> /cm/shared/apps/slurm/var/cm/statesave/job.830332/script, No such file or
> directory
> [2017-04-11T08:22:03+02:00] error: Error opening file
> /cm/shared/apps/slurm/var/cm/statesave/job.830332/environment, No such
> file or directory
> [2017-04-11T08:22:03+02:00] debug2: Spawning RPC agent for msg_type 4005
> [2017-04-11T08:22:03+02:00] debug2: got 1 threads to send out
> [2017-04-11T08:22:03+02:00] debug2: Tree head got back 0 looking for 1
> [2017-04-11T08:22:03+02:00] debug2: Tree head got back 1
> [2017-04-11T08:22:03+02:00] debug2: Tree head got them all
> [2017-04-11T08:22:03+02:00] debug2: node_did_resp cnode001
> [2017-04-11T08:22:03+02:00] debug2: Processing RPC:
> REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=830332
> [2017-04-11T08:22:03+02:00] error: slurmd error running JobId=830332 on
> node(s)=cnode001: Slurmd could not create a batch directory or file
> [2017-04-11T08:22:03+02:00] update_node: node cnode001 reason set to:
> batch job complete failure
> [2017-04-11T08:22:03+02:00] update_node: node cnode001 state set to
> DRAINING
> [2017-04-11T08:22:03+02:00] completing job 830332
> [2017-04-11T08:22:03+02:00] Batch job launch failure, JobId=830332
> [2017-04-11T08:22:03+02:00] debug2: Spawning RPC agent for msg_type 6011
> [2017-04-11T08:22:03+02:00] sched: job_complete for JobId=830332 successful
>
> Node:
> [2017-04-11T08:22:03+02:00] debug2: Processing RPC:
> REQUEST_BATCH_JOB_LAUNCH
> [2017-04-11T08:22:03+02:00] debug:  task_slurmd_batch_request: 830332
> [2017-04-11T08:22:03+02:00] debug:  Calling 
> /cm/shared/apps/slurm/2.5.7/sbin/slurmstepd
> spank prolog
> [2017-04-11T08:22:03+02:00] Reading slurm.conf file: /etc/slurm/slurm.conf
> [2017-04-11T08:22:03+02:00] Running spank/prolog for jobid [830332] uid
> [40281]
> [2017-04-11T08:22:03+02:00] spank: opening plugin stack
> /etc/slurm/plugstack.conf
> [2017-04-11T08:22:03+02:00] debug:  [job 830332] attempting to run prolog
> [/cm/local/apps/cmd/scripts/prolog]
> [2017-04-11T08:22:03+02:00] Launching batch job 830332 for UID 40281
> [2017-04-11T08:22:03+02:00] debug level is 6.
> [2017-04-11T08:22:03+02:00] Job accounting gather LINUX plugin loaded
> [2017-04-11T08:22:03+02:00] WARNING: We will use a much slower algorithm
> with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other
> proctrack when using jobacct_gather/linux
> [2017-04-11T08:22:03+02:00] switch NONE plugin loaded
> [2017-04-11T08:22:03+02:00] Received cpu frequency information for 16 cpus
> [2017-04-11T08:22:03+02:00] setup for a batch_job
> [2017-04-11T08:22:03+02:00] [830332] _make_batch_script: called with NULL
> script
> [2017-04-11T08:22:03+02:00] [830332] batch script setup failed for job
> 830332.4294967294
> [2017-04-11T08:22:03+02:00] [830332] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:4010
> [2017-04-11T08:22:03+02:00] [830332] auth plugin for Munge (
> http://code.google.com/p/munge/) loaded
> [2017-04-11T08:22:03+02:00] [830332] _step_setup: no job returned
> [2017-04-11T08:22:03+02:00] [830332] done with job
> [2017-04-11T08:22:03+02:00] debug2: got this type of message 6011
> [2017-04-11T08:22:03+02:00] debug2: Processing RPC: REQUEST_TERMINATE_JOB
> [2017-04-11T08:22:03+02:00] debug:  _rpc_terminate_job, uid = 450
> [2017-04-11T08:22:03+02:00] debug:  task_slurmd_release_resources: 830332
> [2017-04-11T08:22:03+02:00] debug:  credential for job 830332 revoked
> [2017-04-11T08:22:03+02:00] debug2: No steps in jobid 830332 to send
> signal 18
> [2017-04-11T08:22:03+02:00] debug2: No steps in jobid 830332 to send
> signal 15
> [2017-04-11T08:22:03+02:00] debug2: set revoke expiration for jobid 830332
> to 1491892923 UTS
> [2017-04-11T08:22:03+02:00] debug:  Waiting for job 830332's prolog to
> complete
> [2017-04-11T08:22:03+02:00] debug:  Finished wait for job 830332's prolog
> to complete
>
>
> I have already checked if /cm/shared/apps/slurm/var/cm/statesave is
> accessible and it is, from the node and from the master node.
>
> What i wonder is what triggers this behavior? Is that the Master is not
> able to create the files so the slurm daemon on the compute node fails or
> is the opposite?
>
> The issue happens randomly and it is not possible to reproduce. The same
> kind of job can fail or can work, there is no pattern.
>
> I have increased the verbose to 6 to 9 now but i am not sure it will
> actually help.
>
> I have checked also the logs on the compute node, in order to see if the
> nfs client had issues reaching the server but logs are clean.
>
> Note: For example for this job multiple nodes have been allocated, but
> only cnode001 has failed. The nodes are all running the same configuration.
> I have now undrained the cnode001 and it works without problems. It is
> always like this when this happens.
>
> Any idea?
>
> Kind regards,
>
> --
>
> [image: clustervision_logo.png]
> Andrea Del Monaco
> Internal Engineer
>
>
>
> Skype: delmonaco.andrea
> andrea.delmon...@clustervision.com
>
> ClusterVision BV
> Gyroscoopweg 56
> 1042 AC Amsterdam
> The Netherlands
> Tel: +31 20 407 7550 <+31%2020%20407%207550>
> Fax: +31 84 759 8389 <+31%2084%20759%208389>
> www.clustervision.com
>
>

[slurm-dev] Re: Randomly jobs failures

Reply via email to