[slurm-users] Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)
Hello I am writing to report an issue with the Slurmctld process on our RHEL 9 (Rocky Linux) . Twice in the past 5 days, the Slurmctld process has encountered an error that resulted in the service stopping. The error message displayed was "double free or corruption (out)". This error has caused significant disruption to our jobs, and we are concerned about its recurrence. We have tried troubleshooting the issue, but we have not been able to identify the root cause of the problem. We would appreciate any assistance or guidance you can provide to help us resolve this issue. Please let us know if you need any additional information or if there are any specific steps we should take to diagnose the problem further. Thank you for your attention to this matter. Best regards, _ Jul 09 22:12:01 admin slurmctld[711010]: double free or corruption (fasttop) Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Main process exited, code=killed, status=6/ABRT Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Failed with result 'signal'. Jul 09 22:12:01 admin systemd[1]: slurmctld.service: Consumed 11min 26.451s CPU time. . Jul 14 10:15:01 admin slurmctld[1633720]: double free or corruption (out) Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Main process exited, code=killed, status=6/ABRT Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Failed with result 'signal'. Jul 14 10:15:02 admin systemd[1]: slurmctld.service: Consumed 7min 27.596s CPU time. _ slurmctld -V slurm 22.05.9 cat /etc/slurm/slurm.conf |grep -v '#' ClusterName=xxx SlurmctldHost=admin SlurmctldParameters=enable_configless SlurmUser=slurm AuthType=auth/munge CryptoType=crypto/munge SlurmctldPort=6817 StateSaveLocation=/var/spool/slurmctld SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmctldDebug=verbose DebugFlags=NO_CONF_HASH SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmdLogFile=/var/log/slurm/slurmd.log SlurmdDebug=verbose SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core,CR_LLN DefMemPerCPU=1024 MaxMemPerCPU=4096 GresTypes=gpu ProctrackType=proctrack/cgroup JobAcctGatherType=jobacct_gather/cgroup JobAcctGatherFrequency=15 JobCompType=jobcomp/none TaskPlugin=task/cgroup LaunchParameters=use_interactive_step AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=admin AccountingStoragePort=6819 AccountingStorageEnforce=associations AccountingStorageTRES=gres/gpu MailProg=/usr/bin/mailx EnforcePartLimits=YES MaxArraySize=20 MaxJobCount=50 MpiDefault=none ReturnToService=2 SwitchType=switch/none TmpFS=/tmpslurm/ UsePAM=1 InactiveLimit=0 KillWait=30 MessageTimeout=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 PriorityType=priority/multifactor PriorityFlags=FAIR_TREE,MAX_TRES PriorityDecayHalfLife=1-0 PriorityWeightFairshare=1 NodeName=xxx NodeHostname=xxx CPUs=4 Sockets=4 RealMemory=3500 TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN NodeName=xxx NodeHostname=xxx CPUs=2 Sockets=2 RealMemory=1700 TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN NodeName=xxx NodeHostname=xxx CPUs=4 Sockets=4 RealMemory=1700 TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN NodeName=xxx NodeHostname=xxx CPUs=4 Sockets=4 RealMemory=3500 TmpDisk=1 CoresPerSocket=1 ThreadsPerCore=1 State=DRAIN NodeName=r9nc-24-[1-12] NodeHostname=r9nc-24-[1-12] Sockets=2 CoresPerSocket=12 ThreadsPerCore=1 CPUs=24 RealMemory=18 State=UNKNOWN NodeName=r9nc-48-[1-4] NodeHostname=r9nc-48-[1-4] Sockets=2 CoresPerSocket=24 ThreadsPerCore=1 CPUs=48 RealMemory=48 State=UNKNOWN NodeName=r9ng-1080-[1-7] NodeHostname=r9ng-1080-[1-7] Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 CPUs=20 RealMemory=18 State=UNKNOWN Gres=gpu:1080ti:4 NodeName=r9ng-1080-8 NodeHostname=r9ng-1080-8 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 CPUs=20 RealMemory=176687 State=UNKNOWN Gres=gpu:1080ti:1 PartitionName=24CPUNodes Nodes=r9nc-24-[1-12] State=UP MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=7500 DefMemPerCPU=7500 TRESBillingWeights="CPU=1.0,Mem=0.125G" Default=YES PartitionName=48CPUNodes Nodes=r9nc-48-[1-4] State=UP MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=1 DefMemPerCPU=8000 TRESBillingWeights="CPU=1.0,Mem=0.125G" PartitionName=GPUNodes Nodes=r9ng-1080-[1-7] State=UP MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=9000 DefMemPerCPU=9000 PartitionName=GPUNodes1080-dev Nodes=r9ng-1080-8 State=UP MaxTime=UNLIMITED OverSubscribe=NO MaxMemPerCPU=9000 DefMemPerCPU=9000 Hidden=Yes _ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST 24CPUNodes* up infinite 12 idle r9nc-24-[1-12] 48CPUNodes up infinite 2 idle r9nc-48-[1-2] GPUNodes up infinite 4 idle r9ng-1080-[4-7] GPU
[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)
On 7/15/24 10:43, William VINCENT via slurm-users wrote: I am writing to report an issue with the Slurmctld process on our RHEL 9 (Rocky Linux) . Twice in the past 5 days, the Slurmctld process has encountered an error that resulted in the service stopping. The error message displayed was "double free or corruption (out)". This error has caused significant disruption to our jobs, and we are concerned about its recurrence. We have tried troubleshooting the issue, but we have not been able to identify the root cause of the problem. We would appreciate any assistance or guidance you can provide to help us resolve this issue. Please let us know if you need any additional information or if there are any specific steps we should take to diagnose the problem further. You're running Slurm 22.05.9 on RockyLinux 9 (is that Rocky 9.4 or what?). Such an old Slurm version probably hasn't been tested much on EL9 systems, For security reasons you ought to upgrade to a recent Slurm version, just search for "CVE" in https://github.com/SchedMD/slurm/blob/master/NEWS to find out about security holes in older versions. You can upgrade by 2 major releases in a single step, so you can go to 23.11.8. Upgrading Slurm is fairly easy, and I've collected various pieces of advice in the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm Hopefully a newer Slurm version is going to solve your issue. I hope this helps, Ole -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)
Thank you for your response, I hadn't considered that version 22 could be the problem. I am aware that we are not up to date, but we use the EPEL repo for our RPM packages. Originally, we did not want to install .rpm directly because our policy is to apply security updates every night via the repositories, but unfortunately, in this case, it does not work. I think it is because only one person is responsible for maintaining the packages for RHEL. I have already reported the security issue, but at the moment it does not seem possible to update: https://bugzilla.redhat.com/show_bug.cgi?id=2280545 It appears from another ticket that the compilation fails for version 24: https://bugzilla.redhat.com/show_bug.cgi?id=2259935 If the compilation fails, will the RPM package work on RHEL 9? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)
On 7/15/24 11:35, William V via slurm-users wrote: Thank you for your response, I hadn't considered that version 22 could be the problem. I am aware that we are not up to date, but we use the EPEL repo for our RPM packages. Originally, we did not want to install .rpm directly because our policy is to apply security updates every night via the repositories, but unfortunately, in this case, it does not work. I think it is because only one person is responsible for maintaining the packages for RHEL. You should *NOT* use Slurm packages from the EPEL repository!! The Slurm documentation recommends to exclude those packages, see https://slurm.schedmd.com/upgrades.html#epel_repository I have already reported the security issue, but at the moment it does not seem possible to update: https://bugzilla.redhat.com/show_bug.cgi?id=2280545 RedHat doesn't provide support for Slurm, and if necessary you should contact SchedMD to obtain Slurm support. It appears from another ticket that the compilation fails for version 24: https://bugzilla.redhat.com/show_bug.cgi?id=2259935 I think this ticket only reports problems regarding older Slurm releases? If the compilation fails, will the RPM package work on RHEL 9? You should build your own Slurm RPM packages, and compilation failure would indicate a bug somewhere! Just as a test, I've now built RPM packages of the currently supported Slurm releases 23.11.8 and 24.05.1 on a RockyLinux 9.4 system. The RPMs have built without any issues or compilation errors at all! I haven't tested these RPMs on our production cluster which runs EL8 :-) I recommend that you consult the Slurm documentation page[1] and my Wiki page for Slurm installation: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/ Remember to install all prerequisite packages before building Slurm, as explained in the Wiki! Best regards, Ole [1] https://slurm.schedmd.com/documentation.html -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)
Wow, thank you so much for all this information and the installation wiki. I have a lot of work to do to change the infrastructure, I hope it will go smoothly. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: _refresh_assoc_mgr_qos_list: no new list given back keeping cached one
Hi João, did you get this problem solved? I have the exact same problem and would be very interested. Help would be greatly appreciated! Thank you and best regards, Andi -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] Re: Custom Plugin Integration
Hi Daniel, Thanks for picking up this query. Let me try to briefly describe my problem. As you rightly guessed, we have some hardware on the backend which would be used for our jobs to run. The app which manages the h/w has its own set of resource placement/remapping rules to place a job. So, for eg., if only 3 hosts h1, h2, h3 (2 cores available each) are available at some point for a 4 core job then it's only a few combination of cores from these hosts can be allowed for the job. Also there is a preference order of the placements decided by our app. It's in this respect we want our backend app to bring the placement for the job. Slurm would then dispatch the job accordingly while honoring the exact resource distribution as asked for. In case for the need of preemption as well our backend would decide the placement which would decide which preemptable job candidates to preempt. So, how should we proceed then? We mayn't have the whole site/cluster to ourselves. There me be other jobs which we don't care about & hence they should go in the usual route from the select plugin which is there (linear, cons_tres etc). Is there a scope for a separate partition which will encompass our resources only & trigger our plugin only for our jobs? How do the options a>, b> , c> stand (as described in my 1st message) now that I mention our requirement? A 4th option which comes to my mind is that if there's a possibility through some API interface from Slurm which will inform a separate process P (say) about resource availability on a real time basis. P will talk to our backend app, bring a placement & then ask lSurm to place our job. Your concern about everchanging resources (being allocated before our backend comes up) is uncalled for as the hosts are segregated as far as our system is concerned. Our hosts will run only our jobs & other Slurm jobs would run in different hosts. Hope I make myself little more clearer ! Any help would be appreciated. (Note: We already have a working solution with LSF! LSF does provide option for custom scheduler plugins to let one connect in the decision making loop during scheduling. This only led us to believe Slurm would also have some possibilities.) Regards, Bhaskar. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
[slurm-users] slurmctld hourly: Unexpected missing socket error
Hi all, I am hoping someone can help with our problem. Every hour after restarting slurmctld the controller becomes unresponsive to commands for 1 sec, reporting errors such as: [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934760]] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934875]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934906]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[939016]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO) failed: Unexpected missing socket error It occurs consistently at around the hour mark, but generally not at other times, unless we run a reconfigure or restart the controller. We don’t see any issues in the slurmdbd.log and the errors are also always msg type RESPONSE. We have tried building a new server on different infrastructure, but the problem has persisted. Yesterday we even tried updating slurm to v24.05.1 in the hope that may provide a fix. During our troubleshooting we have: Set: * SchedulerParameters = max_rpc_cnt=400,sched_min_interval=5,sched_max_job_start=300,batch_sched_delay=20,bf_resolution=600,bf_min_prio_reserve=2000,bf_min_age_reserve=600 * SlurmctldPort = 6808-6817 But although the stats in sdiag have improved we still see the errors. On our monitoring software we also see a drop in network and disk activity during this 1 second, always at approx. 1 hour after restarting the controller. Many Thanks in advance Jason Jason Ellul Head - Research Computing Facility Office of Cancer Research Peter MacCallum Cancer Centre -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com