Hi, Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis:
Socket timed out on send/recv operation It sometimes occurs when a job array is started and squeue will display the error: slurm_load_jobs error: Socket timed out on send/recv operation We also see the following errors: slurm_load_jobs error: Zero Bytes were transmitted or received srun: error: Unable to allocate resources: Zero Bytes were transmitted or received sdiag output is below. Does it show an abnormal number of RPC calls by the users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high? Server thread count: 3 Agent queue size: 0 Jobs submitted: 14279 Jobs started: 7709 Jobs completed: 7001 Jobs canceled: 38 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 788 Max cycle: 461780 Total cycles: 3319 Mean cycle: 7589 Mean depth cycle: 3 Cycles per minute: 4 Last queue length: 13 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 3204 Total backfilled jobs (since last stats cycle start): 3160 Total cycles: 436 Last cycle when: Mon Jun 24 15:32:31 2019 Last cycle: 253698 Max cycle: 12701861 Mean cycle: 338674 Last depth cycle: 3 Last depth cycle (try sched): 3 Depth Mean: 15 Depth Mean (try depth): 15 Last queue length: 13 Queue length mean: 3 Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:468871 ave_time:2188 total_time:1026211593 REQUEST_NODE_INFO_SINGLE ( 2040) count:421773 ave_time:1775 total_time:748837928 REQUEST_JOB_INFO ( 2003) count:46877 ave_time:696 total_time:32627442 REQUEST_NODE_INFO ( 2007) count:43575 ave_time:1269 total_time:55301255 REQUEST_JOB_STEP_INFO ( 2005) count:38703 ave_time:201 total_time:7805655 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:29155 ave_time:758 total_time:22118507 REQUEST_JOB_USER_INFO ( 2039) count:22401 ave_time:391 total_time:8763503 MESSAGE_EPILOG_COMPLETE ( 6012) count:7484 ave_time:6164 total_time:46132632 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:7064 ave_time:79129 total_time:558971262 REQUEST_PING ( 1008) count:3561 ave_time:141 total_time:502289 REQUEST_STATS_INFO ( 2035) count:3236 ave_time:568 total_time:1838784 REQUEST_BUILD_INFO ( 2001) count:2598 ave_time:7869 total_time:20445066 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:581 ave_time:132730 total_time:77116427 REQUEST_STEP_COMPLETE ( 5016) count:408 ave_time:4373 total_time:1784564 REQUEST_JOB_STEP_CREATE ( 5001) count:326 ave_time:14832 total_time:4835389 REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:302 ave_time:15754 total_time:4757813 REQUEST_JOB_READY ( 4019) count:78 ave_time:1615 total_time:125980 REQUEST_JOB_INFO_SINGLE ( 2021) count:48 ave_time:7851 total_time:376856 REQUEST_KILL_JOB ( 5032) count:38 ave_time:245 total_time:9346 REQUEST_RESOURCE_ALLOCATION ( 4001) count:28 ave_time:12730 total_time:356466 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:28 ave_time:20504 total_time:574137 REQUEST_CANCEL_JOB_STEP ( 5005) count:7 ave_time:43665 total_time:305661 Remote Procedure Call statistics by user xxxxx ( 0) count:979383 ave_time:2500 total_time:2449350389 xxxxx ( 11160) count:116109 ave_time:695 total_time:80710478 xxxxx ( 11427) count:1264 ave_time:67572 total_time:85411027 xxxxx ( 11426) count:149 ave_time:7361 total_time:1096874 xxxxx ( 12818) count:136 ave_time:11354 total_time:1544190 xxxxx ( 12475) count:37 ave_time:4985 total_time:184452 xxxxx ( 12487) count:36 ave_time:30318 total_time:1091483 xxxxx ( 11147) count:12 ave_time:33489 total_time:401874 xxxxx ( 11345) count:6 ave_time:584 total_time:3508 xxxxx ( 12876) count:6 ave_time:483 total_time:2900 xxxxx ( 11457) count:4 ave_time:345 total_time:1380 Any suggestions/tips are helpful. Rgds