Hello everyone,

I’ve recently encountered an issue where some nodes in our cluster enter a drain state randomly, typically after completing long-running jobs. Below is the output from the |sinfo| command showing the reason *“Prolog error”* :

|root@controller-node:~# sinfo -R REASON USER TIMESTAMP NODELIST Prolog error slurm 2024-09-24T21:18:05 node[24,31] |

When checking the |slurmd.log| files on the nodes, I noticed the following errors:

|[2024-09-24T17:18:22.386] [217703.extern] error: _handle_add_extern_pid_internal: Job 217703 can't add pid 3311892 to jobacct_gather plugin in the extern_step. **(repeated 90 times)** [2024-09-24T17:18:22.917] [217703.extern] error: _handle_add_extern_pid_internal: Job 217703 can't add pid 3313158 to jobacct_gather plugin in the extern_step. ... [2024-09-24T21:17:45.162] launch task StepId=217703.0 request from UID:54059 GID:1600 HOST:<SLURMCTLD_IP> PORT:53514 [2024-09-24T21:18:05.166] error: Waiting for JobId=217703 REQUEST_LAUNCH_PROLOG notification failed, giving up after 20 sec [2024-09-24T21:18:05.166] error: slurm_send_node_msg: [(null)] slurm_bufs_sendto(msg_type=RESPONSE_SLURM_RC_MSG) failed: Unexpected missing socket error [2024-09-24T21:18:05.166] error: _rpc_launch_tasks: unable to send return code to address:port=<SLURMCTLD_IP>:53514 msg_type=6001: No such file or directory |

If you know how to solve these errors, please let me know. I would greatly appreciate any guidance or suggestions for further troubleshooting.

Thank you in advance for your assistance.

Best regards,

​
--
Télécom Paris <https://www.telecom-paris.fr>      
*Nacereddine LADDAOUI*
Ingénieur de Recherche et de Développement

19 place Marguerite Perey
CS 20031
91123 Palaiseau Cedex
Site web Télécom Paris <https://www.telecom-paris.fr>X Télécom Paris <https://twitter.com/TelecomParis_>Facebook Télécom Paris <https://www.facebook.com/TelecomParis>LinkedIn Télécom Paris <https://www.linkedin.com/school/telecom-paris/>Instagram Télécom Paris <https://www.instagram.com/telecom_paris/>Blog Télécom Paris <https://imtech.wp.imt.fr>
Une école de l'IMT <https://www.imt.fr>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to