Hello, trying to get some stats about a running job, I've realized that one of the jobs is consistently failing with:
,---- | sstat: error: slurm_receive_msgs: [[----]:6818] failed: Socket timed out on send/recv operation | sstat: error: slurm_job_step_stat: unknown return given from ----.ll.iac.es: 9001 rc = Communication connection failure | sstat: error: problem getting step_layout for StepId=249974.batch: Communication connection failure `---- Running "sstat" against the other running jobs is not a problem, though the time it takes to get the results varies a lot from one job to another. Is there some timeout variable that I can modify to allow more time for the sstat command to finish? Cheers, -- Ángel de Vicente Research Software Engineer (Supercomputing and BigData) Tel.: +34 922-605-747 Web.: http://research.iac.es/proyecto/polmag/ GPG: 0x8BDC390B69033F52
smime.p7s
Description: S/MIME cryptographic signature