Hi, We're running into an issue where slurmctld core-dumps with the following error. This happens on the backup controller, if it needs to take over from the primary, _for a second time_.
slurmctld: fatal: bit_cache_init: cannot change size once set Has anyone seen this error before? Also if there are any existing discussions and/or tickets related to this, please let me know. Our slurm version is 24.11.1. ________________ Steps to reproduce: 1. On a healthy cluster, we make the primary controller unavailable. Since we're running our cluster on cloud VMs, we cause this by stopping the primary controller VM. 2. From the logs we can see the backup controller take over, and log the message "Running as primary controller" 3. We then start the primary again, making sure the IP addresses and hostnames stay consistent. Once slurmctld on the primary has started and taken back control, we can see the log "Running as primary controller" on that VM. 4. We then stop the primary controller VM again, causing the backup to try taking the control a second time. This time however the slurmctld on the backup coredumps, with following log entries from journalctl -u slurmctld: slurmctld: fatal: bit_cache_init: cannot change size once set slurmctld.service: Main process exited, code=dumped, status=6/ABRT slurmctld.service: Failed with result 'core-dump'. Thanks! - Safdar -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com