Have you checked slurmd/prologue logs? It looks like you job was eligible to run, but it failed to start on computing node. If it failed in prologue you can requeue the job without helding with SchedulerParameters=nohoold_on_prologue_fail.
cheers, Marcin 2017-04-03 21:31 GMT+02:00 Chris Woelkers - NOAA Affiliate < chris.woelk...@noaa.gov>: > > I am running a small HPC, only 24 nodes, via slurm and am having an > issue where one of the users is unable to submit any jobs. > The user is new and whenever a job is submitted it shows the "job > requeued in held state" state and is never actually ran. We have left > the job sitting for over three days and it does not start. We have > tried releasing the job and it does not start. Here are the log > entries after an attempted release: > > [2017-04-03T19:16:24.173] sched: update_job: releasing hold for job_id > 1938 uid 0 > [2017-04-03T19:16:24.174] _slurm_rpc_update_job complete JobId=1938 > uid=0 usec=375 > [2017-04-03T19:16:24.919] sched: Allocate JobId=1938 > NodeList=rhinonode[07-14] #CPUs=192 > [2017-04-03T19:16:25.017] _slurm_rpc_requeue: Processing RPC: > REQUEST_JOB_REQUEUE from uid=0 > [2017-04-03T19:16:25.035] Requeuing JobID=1938 State=0x0 NodeCnt=0 > > The user has the same permissions as the older users that can run jobs. > The script that is being run is a simple test script and no matter > where the output is redirected, an NFS mount(for our SAN), the local > home directory, or the tmp directory, the result is the same. > > Any idea as to what might be happening? > > Thanks, > > Chris Woelkers > Caelum Research Corp. > Linux Server and Network Administrator > NOAA GLERL >