Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

Buckley, Ronan Tue, 25 Jun 2019 03:26:11 -0700

Hi,

I can reproduce the problem by submitting a job array of 700+.
The slurmctld log file is also regularly outputting:


[2019-06-25T11:35:31.159] sched: 157 pending RPCs at cycle end, consider 
configuring max_rpc_cnt
[2019-06-25T11:35:43.007] sched: 193 pending RPCs at cycle end, consider 
configuring max_rpc_cnt
[2019-06-25T11:36:56.517] backfill: 256 pending RPCs at cycle end, consider 
configuring max_rpc_cnt
[2019-06-25T11:37:29.620] backfill: 256 pending RPCs at cycle end, consider 
configuring max_rpc_cnt
[2019-06-25T11:37:45.429] sched: 161 pending RPCs at cycle end, consider 
configuring max_rpc_cnt
[2019-06-25T11:38:00.472] backfill: 256 pending RPCs at cycle end, consider 
configuring max_rpc_cnt

The max_rpc_cnt is currently set to its default of zero.

Rgds

Ronan

From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of Marcelo 
Garcia
Sent: Tuesday, June 25, 2019 10:35 AM
To: Slurm User Community List
Subject: Re: [slurm-users] Slurm: Socket timed out on send/recv operation - 
slurm 17.02.2


[EXTERNAL EMAIL]
Hi

It seems a problem we discussed a few days ago:
https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html
But in that thread I thinking we were using slurm with workflow managers. It's 
interesting that you have the problem after adding the second server and with 
NFS share. Do you have this problem randomly or it's always happening on your 
jobs?

I tried to get an idea how many RPCs would be OK, but I got no reply
https://lists.schedmd.com/pipermail/slurm-users/2019-June/003534.html
My take is that there is no answer to the question, each site is different.

Best Regards

mg.

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Buckley, Ronan
Sent: Dienstag, 25. Juni 2019 11:17
To: 'slurm-users@lists.schedmd.com' 
<slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>; 
slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>
Subject: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 
17.02.2

Hi,

Since configuring a backup slurm controller (including moving the 
StateSaveLocation from a local disk to a NFS share), we are seeing these errors 
in the slurmctld logs on a regular basis:

Socket timed out on send/recv operation

It sometimes occurs when a job array is started and squeue will display the 
error:

slurm_load_jobs error: Socket timed out on send/recv operation

We also see the following errors:

slurm_load_jobs error: Zero Bytes were transmitted or received
srun: error: Unable to allocate resources: Zero Bytes were transmitted or 
received

sdiag output is below. Does it show an abnormal number of RPC calls by the 
users? Are the REQUEST_JOB_INFO and REQUEST_NODE_INFO counts very high?

Server thread count: 3
Agent queue size:    0

Jobs submitted: 14279
Jobs started:   7709
Jobs completed: 7001
Jobs canceled:  38
Jobs failed:    0

Main schedule statistics (microseconds):
        Last cycle:   788
        Max cycle:    461780
        Total cycles: 3319
        Mean cycle:   7589
        Mean depth cycle:  3
        Cycles per minute: 4
        Last queue length: 13

Backfilling stats (WARNING: data obtained in the middle of backfilling 
execution.)
        Total backfilled jobs (since last slurm start): 3204
        Total backfilled jobs (since last stats cycle start): 3160
        Total cycles: 436
        Last cycle when: Mon Jun 24 15:32:31 2019
        Last cycle: 253698
        Max cycle:  12701861
        Mean cycle: 338674
        Last depth cycle: 3
        Last depth cycle (try sched): 3
        Depth Mean: 15
        Depth Mean (try depth): 15
        Last queue length: 13
        Queue length mean: 3

Remote Procedure Call statistics by message type
        REQUEST_PARTITION_INFO                  ( 2009) count:468871 
ave_time:2188   total_time:1026211593
        REQUEST_NODE_INFO_SINGLE                ( 2040) count:421773 
ave_time:1775   total_time:748837928
        REQUEST_JOB_INFO                        ( 2003) count:46877  
ave_time:696    total_time:32627442
        REQUEST_NODE_INFO                       ( 2007) count:43575  
ave_time:1269   total_time:55301255
        REQUEST_JOB_STEP_INFO                   ( 2005) count:38703  
ave_time:201    total_time:7805655
        MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:29155  
ave_time:758    total_time:22118507
        REQUEST_JOB_USER_INFO                   ( 2039) count:22401  
ave_time:391    total_time:8763503
        MESSAGE_EPILOG_COMPLETE                 ( 6012) count:7484   
ave_time:6164   total_time:46132632
        REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:7064   
ave_time:79129  total_time:558971262
        REQUEST_PING                            ( 1008) count:3561   
ave_time:141    total_time:502289
        REQUEST_STATS_INFO                      ( 2035) count:3236   
ave_time:568    total_time:1838784
        REQUEST_BUILD_INFO                      ( 2001) count:2598   
ave_time:7869   total_time:20445066
        REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:581    
ave_time:132730 total_time:77116427
        REQUEST_STEP_COMPLETE                   ( 5016) count:408    
ave_time:4373   total_time:1784564
        REQUEST_JOB_STEP_CREATE                 ( 5001) count:326    
ave_time:14832  total_time:4835389
        REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:302    
ave_time:15754  total_time:4757813
        REQUEST_JOB_READY                       ( 4019) count:78     
ave_time:1615   total_time:125980
        REQUEST_JOB_INFO_SINGLE                 ( 2021) count:48     
ave_time:7851   total_time:376856
        REQUEST_KILL_JOB                        ( 5032) count:38     
ave_time:245    total_time:9346
        REQUEST_RESOURCE_ALLOCATION             ( 4001) count:28     
ave_time:12730  total_time:356466
        REQUEST_COMPLETE_JOB_ALLOCATION         ( 5017) count:28     
ave_time:20504  total_time:574137
        REQUEST_CANCEL_JOB_STEP                 ( 5005) count:7      
ave_time:43665  total_time:305661

Remote Procedure Call statistics by user
        xxxxx           (       0) count:979383 ave_time:2500   
total_time:2449350389
        xxxxx           (   11160) count:116109 ave_time:695    
total_time:80710478
        xxxxx           (   11427) count:1264   ave_time:67572  
total_time:85411027
        xxxxx           (   11426) count:149    ave_time:7361   
total_time:1096874
        xxxxx           (   12818) count:136    ave_time:11354  
total_time:1544190
        xxxxx           (   12475) count:37     ave_time:4985   
total_time:184452
        xxxxx           (   12487) count:36     ave_time:30318  
total_time:1091483
        xxxxx           (   11147) count:12     ave_time:33489  
total_time:401874
        xxxxx           (   11345) count:6      ave_time:584    total_time:3508
        xxxxx           (   12876) count:6      ave_time:483    total_time:2900
        xxxxx           (   11457) count:4      ave_time:345    total_time:1380

Any suggestions/tips are helpful.
Rgds


Click 
here<https://www.mailcontrol.com/sr/E3MG1ttEFmzGX2PQPOmvUrn00dwD0CtTR50NQzaa0Hzyu5oRJaiy8o4IRepqswOkHdrQZ5lrk5_gE3KctAewCA==>
 to report this email as spam.

Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

Reply via email to