Just FYI, I tried the shared state on NFS once, and it didn't work well. Switched to native client glusterfs shared between the 2 controller nodes and haven't had a problem with it since.
On Tue, Jun 25, 2019 at 6:32 AM Buckley, Ronan <ronan.buck...@dell.com> wrote: > > Is there a way to diagnose if the I/O to the > /cm/shared/apps/slurm/var/cm/statesave directory (Used for job status) on the > NFS storage is the cause of the socket errors? > > What values/threshold from the nfsiostat command would signal the NFS storage > as the bottleneck? > > > > From: Buckley, Ronan > Sent: Tuesday, June 25, 2019 11:21 AM > To: Slurm User Community List; slurm-users-boun...@lists.schedmd.com > Subject: RE: [slurm-users] Slurm: Socket timed out on send/recv operation - > slurm 17.02.2 > > > > Hi, > > > > I can reproduce the problem by submitting a job array of 700+. > > The slurmctld log file is also regularly outputting: > > > > [2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider > configuring max_rpc_cnt > > [2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider > configuring max_rpc_cnt > > [2019-06-25T11:36:56.517] backfill: 256 pending RPCs at cycle end, consider > configuring max_rpc_cnt > > [2019-06-25T11:37:29.620] backfill: 256 pending RPCs at cycle end, consider > configuring max_rpc_cnt > > [2019-06-25T11:37:45.429] sched: 161 pending RPCs at cycle end, consider > configuring max_rpc_cnt > > [2019-06-25T11:38:00.472] backfill: 256 pending RPCs at cycle end, consider > configuring max_rpc_cnt > > > > The max_rpc_cnt is currently set to its default of zero. > > > > Rgds > > > > Ronan > > > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of > Marcelo Garcia > Sent: Tuesday, June 25, 2019 10:35 AM > To: Slurm User Community List > Subject: Re: [slurm-users] Slurm: Socket timed out on send/recv operation - > slurm 17.02.2 > > > > [EXTERNAL EMAIL] > > Hi > > > > It seems a problem we discussed a few days ago: > > https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html > > But in that thread I thinking we were using slurm with workflow managers. > It's interesting that you have the problem after adding the second server and > with NFS share. Do you have this problem randomly or it's always happening on > your jobs? > > > > I tried to get an idea how many RPCs would be OK, but I got no reply > > https://lists.schedmd.com/pipermail/slurm-users/2019-June/003534.html > > My take is that there is no answer to the question, each site is different. > > > > Best Regards > > > > mg. > > > > From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of > Buckley, Ronan > Sent: Dienstag, 25. Juni 2019 11:17 > To: 'slurm-users@lists.schedmd.com' <slurm-users@lists.schedmd.com>; > slurm-users-boun...@lists.schedmd.com > Subject: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm > 17.02.2 > > > > Hi, > > > > Since configuring a backup slurm controller (including moving the > StateSaveLocation from a local disk to a NFS share), we are seeing these > errors in the slurmctld logs on a regular basis: > > > > Socket timed out on send/recv operation > > > > It sometimes occurs when a job array is started and squeue will display the > error: > > > > slurm_load_jobs error: Socket timed out on send/recv operation > > > > We also see the following errors: > > > > slurm_load_jobs error: Zero Bytes were transmitted or received > > srun: error: Unable to allocate resources: Zero Bytes were transmitted or > received > > > > sdiag output is below. Does it show an abnormal number of RPC calls by the > users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high? > > > > Server thread count: 3 > > Agent queue size: 0 > > > > Jobs submitted: 14279 > > Jobs started: 7709 > > Jobs completed: 7001 > > Jobs canceled: 38 > > Jobs failed: 0 > > > > Main schedule statistics (microseconds): > > Last cycle: 788 > > Max cycle: 461780 > > Total cycles: 3319 > > Mean cycle: 7589 > > Mean depth cycle: 3 > > Cycles per minute: 4 > > Last queue length: 13 > > > > Backfilling stats (WARNING: data obtained in the middle of backfilling > execution.) > > Total backfilled jobs (since last slurm start): 3204 > > Total backfilled jobs (since last stats cycle start): 3160 > > Total cycles: 436 > > Last cycle when: Mon Jun 24 15:32:31 2019 > > Last cycle: 253698 > > Max cycle: 12701861 > > Mean cycle: 338674 > > Last depth cycle: 3 > > Last depth cycle (try sched): 3 > > Depth Mean: 15 > > Depth Mean (try depth): 15 > > Last queue length: 13 > > Queue length mean: 3 > > > > Remote Procedure Call statistics by message type > > REQUEST_PARTITION_INFO ( 2009) count:468871 > ave_time:2188 total_time:1026211593 > > REQUEST_NODE_INFO_SINGLE ( 2040) count:421773 > ave_time:1775 total_time:748837928 > > REQUEST_JOB_INFO ( 2003) count:46877 > ave_time:696 total_time:32627442 > > REQUEST_NODE_INFO ( 2007) count:43575 > ave_time:1269 total_time:55301255 > > REQUEST_JOB_STEP_INFO ( 2005) count:38703 > ave_time:201 total_time:7805655 > > MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:29155 > ave_time:758 total_time:22118507 > > REQUEST_JOB_USER_INFO ( 2039) count:22401 > ave_time:391 total_time:8763503 > > MESSAGE_EPILOG_COMPLETE ( 6012) count:7484 > ave_time:6164 total_time:46132632 > > REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:7064 > ave_time:79129 total_time:558971262 > > REQUEST_PING ( 1008) count:3561 > ave_time:141 total_time:502289 > > REQUEST_STATS_INFO ( 2035) count:3236 > ave_time:568 total_time:1838784 > > REQUEST_BUILD_INFO ( 2001) count:2598 > ave_time:7869 total_time:20445066 > > REQUEST_SUBMIT_BATCH_JOB ( 4003) count:581 > ave_time:132730 total_time:77116427 > > REQUEST_STEP_COMPLETE ( 5016) count:408 > ave_time:4373 total_time:1784564 > > REQUEST_JOB_STEP_CREATE ( 5001) count:326 > ave_time:14832 total_time:4835389 > > REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:302 > ave_time:15754 total_time:4757813 > > REQUEST_JOB_READY ( 4019) count:78 > ave_time:1615 total_time:125980 > > REQUEST_JOB_INFO_SINGLE ( 2021) count:48 > ave_time:7851 total_time:376856 > > REQUEST_KILL_JOB ( 5032) count:38 > ave_time:245 total_time:9346 > > REQUEST_RESOURCE_ALLOCATION ( 4001) count:28 > ave_time:12730 total_time:356466 > > REQUEST_COMPLETE_JOB_ALLOCATION ( 5017) count:28 > ave_time:20504 total_time:574137 > > REQUEST_CANCEL_JOB_STEP ( 5005) count:7 > ave_time:43665 total_time:305661 > > > > Remote Procedure Call statistics by user > > xxxxx ( 0) count:979383 ave_time:2500 > total_time:2449350389 > > xxxxx ( 11160) count:116109 ave_time:695 > total_time:80710478 > > xxxxx ( 11427) count:1264 ave_time:67572 > total_time:85411027 > > xxxxx ( 11426) count:149 ave_time:7361 > total_time:1096874 > > xxxxx ( 12818) count:136 ave_time:11354 > total_time:1544190 > > xxxxx ( 12475) count:37 ave_time:4985 > total_time:184452 > > xxxxx ( 12487) count:36 ave_time:30318 > total_time:1091483 > > xxxxx ( 11147) count:12 ave_time:33489 > total_time:401874 > > xxxxx ( 11345) count:6 ave_time:584 > total_time:3508 > > xxxxx ( 12876) count:6 ave_time:483 > total_time:2900 > > xxxxx ( 11457) count:4 ave_time:345 > total_time:1380 > > > > Any suggestions/tips are helpful. > > Rgds > > > > Click here to report this email as spam.