[slurm-users] Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread William VINCENT via slurm-users
Hello I am writing to report an issue with the Slurmctld process on our RHEL 9 (Rocky Linux) . Twice in the past 5 days, the Slurmctld process has encountered an error that resulted in the service stopping. The error message displayed was "double free or corruption (out)". This error has cau

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread Ole Holm Nielsen via slurm-users
On 7/15/24 10:43, William VINCENT via slurm-users wrote: I am writing to report an issue with the Slurmctld process on our RHEL 9 (Rocky Linux) . Twice in the past 5 days, the Slurmctld process has encountered an error that resulted in the service stopping. The error message displayed was "do

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread William V via slurm-users
Thank you for your response, I hadn't considered that version 22 could be the problem. I am aware that we are not up to date, but we use the EPEL repo for our RPM packages. Originally, we did not want to install .rpm directly because our policy is to apply security updates every night via the r

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread Ole Holm Nielsen via slurm-users
On 7/15/24 11:35, William V via slurm-users wrote: Thank you for your response, I hadn't considered that version 22 could be the problem. I am aware that we are not up to date, but we use the EPEL repo for our RPM packages. Originally, we did not want to install .rpm directly because our poli

[slurm-users] Re: Slurmctld process error 'double free or corruption' on RHEL 9 (Rocky Linux)

2024-07-15 Thread William V via slurm-users
Wow, thank you so much for all this information and the installation wiki. I have a lot of work to do to change the infrastructure, I hope it will go smoothly. -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: _refresh_assoc_mgr_qos_list: no new list given back keeping cached one

2024-07-15 Thread andreas.wiedholz--- via slurm-users
Hi João, did you get this problem solved? I have the exact same problem and would be very interested. Help would be greatly appreciated! Thank you and best regards, Andi -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedm

[slurm-users] Re: Custom Plugin Integration

2024-07-15 Thread jubhaskar--- via slurm-users
Hi Daniel, Thanks for picking up this query. Let me try to briefly describe my problem. As you rightly guessed, we have some hardware on the backend which would be used for our jobs to run. The app which manages the h/w has its own set of resource placement/remapping rules to place a job. So, fo

[slurm-users] slurmctld hourly: Unexpected missing socket error

2024-07-15 Thread Jason Ellul via slurm-users
Hi all, I am hoping someone can help with our problem. Every hour after restarting slurmctld the controller becomes unresponsive to commands for 1 sec, reporting errors such as: [2024-07-15T11:45:48.509] error: slurm_send_node_msg: [socket:[934767]] slurm_bufs_sendto(msg_type=RESPONSE_JOB_INFO