Hi, I can reproduce the problem by submitting a job array of 700+. The slurmctld log file is also regularly outputting:
[2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider configuring max_rpc_cnt [2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider configuring max_rpc_cnt [2019-06-25T11:36:56.517] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt [2019-06-25T11:37:29.620] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt [2019-06-25T11:37:45.429] sched: 161 pending RPCs at cycle end, consider configuring max_rpc_cnt [2019-06-25T11:38:00.472] backfill: 256 pending RPCs at cycle end, consider configuring max_rpc_cnt The max_rpc_cnt is currently set to its default of zero. Rgds Ronan From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Marcelo Garcia Sent: Tuesday, June 25, 2019 10:35 AM To: Slurm User Community List Subject: Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2 [EXTERNAL EMAIL] Hi It seems a problem we discussed a few days ago: https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html But in that thread I thinking we were using slurm with workflow managers. It's interesting that you have the problem after adding the second server and with NFS share. Do you have this problem randomly or it's always happening on your jobs? I tried to get an idea how many RPCs would be OK, but I got no reply https://lists.schedmd.com/pipermail/slurm-users/2019-June/003534.html My take is that there is no answer to the question, each site is different. Best Regards mg. From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Buckley, Ronan Sent: Dienstag, 25. Juni 2019 11:17 To: 'slurm-users@lists.schedmd.com' <slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>; slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com> Subject: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2 Hi, Since configuring a backup slurm controller (including moving the StateSaveLocation from a local disk to a NFS share), we are seeing these errors in the slurmctld logs on a regular basis: Socket timed out on send/recv operation It sometimes occurs when a job array is started and squeue will display the error: slurm_load_jobs error: Socket timed out on send/recv operation We also see the following errors: slurm_load_jobs error: Zero Bytes were transmitted or received srun: error: Unable to allocate resources: Zero Bytes were transmitted or received sdiag output is below. Does it show an abnormal number of RPC calls by the users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high? Server thread count: 3 Agent queue size: 0 Jobs submitted: 14279 Jobs started: 7709 Jobs completed: 7001 Jobs canceled: 38 Jobs failed: 0 Main schedule statistics (microseconds): Last cycle: 788 Max cycle: 461780 Total cycles: 3319 Mean cycle: 7589 Mean depth cycle: 3 Cycles per minute: 4 Last queue length: 13 Backfilling stats (WARNING: data obtained in the middle of backfilling execution.) Total backfilled jobs (since last slurm start): 3204 Total backfilled jobs (since last stats cycle start): 3160 Total cycles: 436 Last cycle when: Mon Jun 24 15:32:31 2019 Last cycle: 253698 Max cycle: 12701861 Mean cycle: 338674 Last depth cycle: 3 Last depth cycle (try sched): 3 Depth Mean: 15 Depth Mean (try depth): 15 Last queue length: 13 Queue length mean: 3 Remote Procedure Call statistics by message type REQUEST_PARTITION_INFO ( 2009) count:468871 ave_time:2188 total_time:1026211593 REQUEST_NODE_INFO_SINGLE ( 2040) count:421773 ave_time:1775 total_time:748837928 REQUEST_JOB_INFO ( 2003) count:46877 ave_time:696 total_time:32627442 REQUEST_NODE_INFO ( 2007) count:43575 ave_time:1269 total_time:55301255 REQUEST_JOB_STEP_INFO ( 2005) count:38703 ave_time:201 total_time:7805655 MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:29155 ave_time:758 total_time:22118507 REQUEST_JOB_USER_INFO ( 2039) count:22401 ave_time:391 total_time:8763503 MESSAGE_EPILOG_COMPLETE ( 6012) count:7484 ave_time:6164 total_time:46132632 REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:7064 ave_time:79129 total_time:558971262 REQUEST_PING ( 1008) count:3561 ave_time:141 total_time:502289 REQUEST_STATS_INFO ( 2035) count:3236 ave_time:568 total_time:1838784 REQUEST_BUILD_INFO ( 2001) count:2598 ave_time:7869 total_time:20445066 REQUEST_SUBMIT_BATCH_JOB ( 4003) count:581 ave_time:132730 total_time:77116427 REQUEST_STEP_COMPLETE ( 5016) count:408 ave_time:4373 total_time:1784564 REQUEST_JOB_STEP_CREATE ( 5001) count:326 ave_time:14832 total_time:4835389 REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:302 ave_time:15754 total_time:4757813 REQUEST_JOB_READY ( 4019) count:78 ave_time:1615 total_time:125980 REQUEST_JOB_INFO_SINGLE ( 2021) count:48 ave_time:7851 total_time:376856 REQUEST_KILL_JOB ( 5032) count:38 ave_time:245 total_time:9346 REQUEST_RESOURCE_ALLOCATION ( 4001) count:28 ave_time:12730 total_time:356466 REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:28 ave_time:20504 total_time:574137 REQUEST_CANCEL_JOB_STEP ( 5005) count:7 ave_time:43665 total_time:305661 Remote Procedure Call statistics by user xxxxx ( 0) count:979383 ave_time:2500 total_time:2449350389 xxxxx ( 11160) count:116109 ave_time:695 total_time:80710478 xxxxx ( 11427) count:1264 ave_time:67572 total_time:85411027 xxxxx ( 11426) count:149 ave_time:7361 total_time:1096874 xxxxx ( 12818) count:136 ave_time:11354 total_time:1544190 xxxxx ( 12475) count:37 ave_time:4985 total_time:184452 xxxxx ( 12487) count:36 ave_time:30318 total_time:1091483 xxxxx ( 11147) count:12 ave_time:33489 total_time:401874 xxxxx ( 11345) count:6 ave_time:584 total_time:3508 xxxxx ( 12876) count:6 ave_time:483 total_time:2900 xxxxx ( 11457) count:4 ave_time:345 total_time:1380 Any suggestions/tips are helpful. Rgds Click here<https://www.mailcontrol.com/sr/E3MG1ttEFmzGX2PQPOmvUrn00dwD0CtTR50NQzaa0Hzyu5oRJaiy8o4IRepqswOkHdrQZ5lrk5_gE3KctAewCA==> to report this email as spam.