Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

Eli V Tue, 25 Jun 2019 06:04:11 -0700

Just FYI, I tried the shared state on NFS once, and it didn't work
well. Switched to native client glusterfs shared between the 2
controller nodes and haven't had a problem with it since.


On Tue, Jun 25, 2019 at 6:32 AM Buckley, Ronan <ronan.buck...@dell.com> wrote:
>
> Is there a way to diagnose if the I/O to the 
> /cm/shared/apps/slurm/var/cm/statesave directory (Used for job status) on the 
> NFS storage is the cause of the socket errors?
>
> What values/threshold from the nfsiostat command would signal the NFS storage 
> as the bottleneck?
>
>
>
> From: Buckley, Ronan
> Sent: Tuesday, June 25, 2019 11:21 AM
> To: Slurm User Community List; slurm-users-boun...@lists.schedmd.com
> Subject: RE: [slurm-users] Slurm: Socket timed out on send/recv operation - 
> slurm 17.02.2
>
>
>
> Hi,
>
>
>
> I can reproduce the problem by submitting a job array of 700+.
>
> The slurmctld log file is also regularly outputting:
>
>
>
> [2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider 
> configuring max_rpc_cnt
>
> [2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider 
> configuring max_rpc_cnt
>
> [2019-06-25T11:36:56.517] backfill: 256 pending RPCs at cycle end, consider 
> configuring max_rpc_cnt
>
> [2019-06-25T11:37:29.620] backfill: 256 pending RPCs at cycle end, consider 
> configuring max_rpc_cnt
>
> [2019-06-25T11:37:45.429] sched: 161 pending RPCs at cycle end, consider 
> configuring max_rpc_cnt
>
> [2019-06-25T11:38:00.472] backfill: 256 pending RPCs at cycle end, consider 
> configuring max_rpc_cnt
>
>
>
> The max_rpc_cnt is currently set to its default of zero.
>
>
>
> Rgds
>
>
>
> Ronan
>
>
>
> From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of 
> Marcelo Garcia
> Sent: Tuesday, June 25, 2019 10:35 AM
> To: Slurm User Community List
> Subject: Re: [slurm-users] Slurm: Socket timed out on send/recv operation - 
> slurm 17.02.2
>
>
>
> [EXTERNAL EMAIL]
>
> Hi
>
>
>
> It seems a problem we discussed a few days ago:
>
> https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html
>
> But in that thread I thinking we were using slurm with workflow managers. 
> It's interesting that you have the problem after adding the second server and 
> with NFS share. Do you have this problem randomly or it's always happening on 
> your jobs?
>
>
>
> I tried to get an idea how many RPCs would be OK, but I got no reply
>
> https://lists.schedmd.com/pipermail/slurm-users/2019-June/003534.html
>
> My take is that there is no answer to the question, each site is different.
>
>
>
> Best Regards
>
>
>
> mg.
>
>
>
> From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
> Buckley, Ronan
> Sent: Dienstag, 25. Juni 2019 11:17
> To: 'slurm-users@lists.schedmd.com' <slurm-users@lists.schedmd.com>; 
> slurm-users-boun...@lists.schedmd.com
> Subject: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 
> 17.02.2
>
>
>
> Hi,
>
>
>
> Since configuring a backup slurm controller (including moving the 
> StateSaveLocation from a local disk to a NFS share), we are seeing these 
> errors in the slurmctld logs on a regular basis:
>
>
>
> Socket timed out on send/recv operation
>
>
>
> It sometimes occurs when a job array is started and squeue will display the 
> error:
>
>
>
> slurm_load_jobs error: Socket timed out on send/recv operation
>
>
>
> We also see the following errors:
>
>
>
> slurm_load_jobs error: Zero Bytes were transmitted or received
>
> srun: error: Unable to allocate resources: Zero Bytes were transmitted or 
> received
>
>
>
> sdiag output is below. Does it show an abnormal number of RPC calls by the 
> users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high?
>
>
>
> Server thread count: 3
>
> Agent queue size:    0
>
>
>
> Jobs submitted: 14279
>
> Jobs started:   7709
>
> Jobs completed: 7001
>
> Jobs canceled:  38
>
> Jobs failed:    0
>
>
>
> Main schedule statistics (microseconds):
>
>         Last cycle:   788
>
>         Max cycle:    461780
>
>         Total cycles: 3319
>
>         Mean cycle:   7589
>
>         Mean depth cycle:  3
>
>         Cycles per minute: 4
>
>         Last queue length: 13
>
>
>
> Backfilling stats (WARNING: data obtained in the middle of backfilling 
> execution.)
>
>         Total backfilled jobs (since last slurm start): 3204
>
>         Total backfilled jobs (since last stats cycle start): 3160
>
>         Total cycles: 436
>
>         Last cycle when: Mon Jun 24 15:32:31 2019
>
>         Last cycle: 253698
>
>         Max cycle:  12701861
>
>         Mean cycle: 338674
>
>         Last depth cycle: 3
>
>         Last depth cycle (try sched): 3
>
>         Depth Mean: 15
>
>         Depth Mean (try depth): 15
>
>         Last queue length: 13
>
>         Queue length mean: 3
>
>
>
> Remote Procedure Call statistics by message type
>
>         REQUEST_PARTITION_INFO                  ( 2009) count:468871 
> ave_time:2188   total_time:1026211593
>
>         REQUEST_NODE_INFO_SINGLE                ( 2040) count:421773 
> ave_time:1775   total_time:748837928
>
>         REQUEST_JOB_INFO                        ( 2003) count:46877  
> ave_time:696    total_time:32627442
>
>         REQUEST_NODE_INFO                       ( 2007) count:43575  
> ave_time:1269   total_time:55301255
>
>         REQUEST_JOB_STEP_INFO                   ( 2005) count:38703  
> ave_time:201    total_time:7805655
>
>         MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:29155  
> ave_time:758    total_time:22118507
>
>         REQUEST_JOB_USER_INFO                   ( 2039) count:22401  
> ave_time:391    total_time:8763503
>
>         MESSAGE_EPILOG_COMPLETE                 ( 6012) count:7484   
> ave_time:6164   total_time:46132632
>
>         REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:7064   
> ave_time:79129  total_time:558971262
>
>         REQUEST_PING                            ( 1008) count:3561   
> ave_time:141    total_time:502289
>
>         REQUEST_STATS_INFO                      ( 2035) count:3236   
> ave_time:568    total_time:1838784
>
>         REQUEST_BUILD_INFO                      ( 2001) count:2598   
> ave_time:7869   total_time:20445066
>
>         REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:581    
> ave_time:132730 total_time:77116427
>
>         REQUEST_STEP_COMPLETE                   ( 5016) count:408    
> ave_time:4373   total_time:1784564
>
>         REQUEST_JOB_STEP_CREATE                 ( 5001) count:326    
> ave_time:14832  total_time:4835389
>
>         REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:302    
> ave_time:15754  total_time:4757813
>
>         REQUEST_JOB_READY                       ( 4019) count:78     
> ave_time:1615   total_time:125980
>
>         REQUEST_JOB_INFO_SINGLE                 ( 2021) count:48     
> ave_time:7851   total_time:376856
>
>         REQUEST_KILL_JOB                        ( 5032) count:38     
> ave_time:245    total_time:9346
>
>         REQUEST_RESOURCE_ALLOCATION             ( 4001) count:28     
> ave_time:12730  total_time:356466
>
>         REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:28     
> ave_time:20504  total_time:574137
>
>         REQUEST_CANCEL_JOB_STEP                 ( 5005) count:7      
> ave_time:43665  total_time:305661
>
>
>
> Remote Procedure Call statistics by user
>
>         xxxxx           (       0) count:979383 ave_time:2500   
> total_time:2449350389
>
>         xxxxx           (   11160) count:116109 ave_time:695    
> total_time:80710478
>
>         xxxxx           (   11427) count:1264   ave_time:67572  
> total_time:85411027
>
>         xxxxx           (   11426) count:149    ave_time:7361   
> total_time:1096874
>
>         xxxxx           (   12818) count:136    ave_time:11354  
> total_time:1544190
>
>         xxxxx           (   12475) count:37     ave_time:4985   
> total_time:184452
>
>         xxxxx           (   12487) count:36     ave_time:30318  
> total_time:1091483
>
>         xxxxx           (   11147) count:12     ave_time:33489  
> total_time:401874
>
>         xxxxx           (   11345) count:6      ave_time:584    
> total_time:3508
>
>         xxxxx           (   12876) count:6      ave_time:483    
> total_time:2900
>
>         xxxxx           (   11457) count:4      ave_time:345    
> total_time:1380
>
>
>
> Any suggestions/tips are helpful.
>
> Rgds
>
>
>
> Click here to report this email as spam.

Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

Reply via email to