Hi, we've been facing the same issue for some time. At the beginning the missing socket error happened every 20 minutes, later once per hour, now it happens few times a day. The only downside of this was that controller was unresponsive for that couple of seconds - up to 60, if I remember well. We tried to debug it in many ways, but we've found no straightforward solution or source of problems.
Things we've changed since the problem came up: * RPC user limit: `SlurmctldParameters=rl_enable,rl_bucket_size=50,rl_refill_period=1,rl_refill_rate=2,rl_table_size=16384` * made sure that VM that slurm runs on has "network-latency" profile in `tuned`, also the same profile on worker nodes * implemented some of these recommendations https://slurm.schedmd.com/high_throughput.html on controllers * largely optimized slurmdb by some housekeeping and cleaning up inactive accounts, associations etc. * optimized SSSD configuration (this one I believe had the biggest impact) both on controllers and on worker nodes plus plenty of other (not related I guess) changes. I'm not really sure if any of above helped us significantly in that matter. Best regards, Patryk Belzak. On 24/07/16 03:45, Jason Ellul via slurm-users wrote: [-- Type: text/plain; charset=Windows-1252, Encoding: quoted-printable, Size: 2,0K --] > Hi all, > > I am hoping someone can help with our problem. Every hour after restarting > slurmctld the controller becomes unresponsive to commands for 1 sec, > reporting errors such as: > > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] > slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing > socket error > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] > slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing > socket error > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] > slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing > socket error > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] > slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing > socket error > [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] > slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing > socket error > > It occurs consistently at around the hour mark, but generally not at other > times, unless we run a reconfigure or restart the controller. We don’t see > any issues in the slurmdbd.log and the errors are also always msg type > RESPONSE. We have tried building a new server on different infrastructure, > but the problem has persisted. Yesterday we even tried updating slurm to > v24.05.1 in the hope that may provide a fix. During our troubleshooting we > have: > Set: > > * > SchedulerParameters = > max_rpc_cnt=400,sched_min_interval=50000,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600 > * > SlurmctldPort = 6808-6817 > > But although the stats in sdiag have improved we still see the errors. > > On our monitoring software we also see a drop in network and disk activity > during this 1 second, always at approx. 1 hour after restarting the > controller. > > Many Thanks in advance > > Jason > > Jason Ellul > Head - Research Computing Facility > Office of Cancer Research > Peter MacCallum Cancer Centre [-- Alternative Type #1: text/html; charset=Windows-1252, Encoding: quoted-printable, Size: 6,9K --] > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
smime.p7s
Description: S/MIME cryptographic signature
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com