Hi,
we've been facing the same issue for some time. At the beginning the missing 
socket error happened every 20 minutes, later once per hour, now it happens few 
times a day.
The only downside of this was that controller was unresponsive for that couple 
of seconds - up to 60, if I remember well.
We tried to debug it in many ways, but we've found no straightforward solution 
or source of problems.

Things we've changed since the problem came up:
* RPC user limit: 
`SlurmctldParameters=rl_enable,rl_bucket_size=50,rl_refill_period=1,rl_refill_rate=2,rl_table_size=16384`
* made sure that VM that slurm runs on has "network-latency" profile in 
`tuned`, also the same profile on worker nodes
* implemented some of these recommendations 
https://slurm.schedmd.com/high_throughput.html on controllers
* largely optimized slurmdb by some housekeeping and cleaning up inactive 
accounts, associations etc.
* optimized SSSD configuration (this one I believe had the biggest impact) both 
on controllers and on worker nodes
plus plenty of other (not related I guess) changes.

I'm not really sure if any of above helped us significantly in that matter.

Best regards,
Patryk Belzak.

On 24/07/16 03:45, Jason Ellul via slurm-users wrote:
[-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 
2,0K --]
> Hi all,
> 
> I am hoping someone can help with our problem. Every hour after restarting 
> slurmctld the controller becomes unresponsive to commands for 1 sec, 
> reporting errors such as:
> 
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] 
> slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing 
> socket error
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] 
> slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing 
> socket error
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] 
> slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing 
> socket error
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] 
> slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing 
> socket error
> [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] 
> slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing 
> socket error
> 
> It occurs consistently at around the hour mark, but generally not at other 
> times, unless we run a reconfigure or restart the controller. We don’t see 
> any issues in the slurmdbd.log and the errors are also always msg type 
> RESPONSE. We have tried building a new server on different infrastructure, 
> but the problem has persisted. Yesterday we even tried updating slurm to 
> v24.05.1 in the hope that may provide a fix. During our troubleshooting we 
> have:
> Set:
> 
>   *
> SchedulerParameters     = 
> max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600
>   *
> SlurmctldPort           = 6808-6817
> 
> But although the stats in sdiag have improved we still see the errors.
> 
> On our monitoring software we also see a drop in network and disk activity 
> during this 1 second, always at approx. 1 hour after restarting the 
> controller.
> 
> Many Thanks in advance
> 
> Jason
> 
> Jason Ellul
> Head - Research Computing Facility
> Office of Cancer Research
> Peter MacCallum Cancer Centre

[-- Alternative Type #1: text/html; charset=Windows-1252, Encoding: 
quoted-printable, Size: 6,9K --]

> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Attachment: smime.p7s
Description: S/MIME cryptographic signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to