Folks, We have a Slurm cluster (version 18.06-2) with many nodes, and are frequently running into the “agent message queue gets longer” issue. After reviewing bug reports with similar symptoms (like https://bugs.schedmd.com/show_bug.cgi?id=5147), and studying the code, I've come to the conclusion that there seems to be an actual bug in the codebase. It doesn't appear to be fixed in 19.05.
Since I'm personally new to Slurm, I thought I'd post and see if anyone who knows more than me can weigh in. Here is our scenario: 1. A lot of jobs (hundreds?) go into COMPLETING state within a short period of time. 2. The number of messages in the agent message queue (retry_list) keeps increasing until it gets to thousands and never recovers (so we reboot the controller) 3. The number of agent threads is never very big (reported by sdiag). Sometimes it is just 2 or 4. At most, it is around 35. In no case does the number of agent threads reflect the algorithm that is in the code (which is that up to 256-number of RPC threads-(10 or 12) threads can be run to dispatch messages if you have hundreds of messages to send. 4. The total number of messages actually delivered to the nodes (based on the debug logs) is around 600 per minute (not surprising given only a few threads sending the messages). 5. The scheduler wakes up every minute and enqueues TERMINATE_JOB messages to all the jobs in COMPLETING state to remind the nodes to kill them (or report they are done). There are hundreds of these, so: if the number of messages being enqueue exceeds the number of messages being delivered, the queue just keeps increasing. 6. Eventually the queue is mostly TERMINATE_JOB messages and all other control messages (eg START JOB) are a minority of the queue. So jobs don't start. Looking at the background (bug database, support requests) this seems to be a well understood scenario, however, the triggering events all seem to be related to other bugs (eg https://bugs.schedmd.com/show_bug.cgi?id=5111), so it seems the agent queue problem remains undiagnosed. With that as introduction, here is what I see happening: In agent.c, we have the main agent loop (with a lot of code elided out with ...): /* Start a thread to manage queued agent requests */ static void *_agent_init(void *arg) { int min_wait; bool mail_too; struct timespec ts = {0, 0}; while (true) { slurm_mutex_lock(&pending_mutex); while (!... (pending_wait_time == NO_VAL16)) { ts.tv_sec = time(NULL) + 2; slurm_cond_timedwait(&pending_cond, &pending_mutex, &ts)1; } ... min_wait = pending_wait_time; pending_mail = false; pending_wait_time = NO_VAL16; slurm_mutex_unlock(&pending_mutex); _agent_retry(min_wait, mail_too); } ... } You can think of the above as a "consumer" loop in a "producer-consumer" pattern, where we wait until the condition variable (pending_cond) is signaled, get a message from the queue and dispatch it, and go again. The producer side of the pattern looks like this: void agent_queue_request(agent_arg_t *agent_arg_ptr) { ... list_append(retry_list, (void *)queued_req_ptr); ... agent_trigger(999, false); } and extern void agent_trigger(int min_wait, bool mail_too) { slurm_mutex_lock(&pending_mutex); if ((pending_wait_time == NO_VAL16) || (pending_wait_time > min_wait)) pending_wait_time = min_wait; if (mail_too) pending_mail = mail_too; slurm_cond_broadcast(&pending_cond); slurm_mutex_unlock(&pending_mutex); } Here is the problem: the consumer side (the _agent_init() loop) consumes only one message each time around the loop (that's what _agent_retry() does), regardless of how many messages are in the queue. However, it is possible that more than one message gets added to the queue during that time. For example, suppose _agent_init() is waiting on the slurm_cond_timedwait(), and the scheduler thread enqueues a lot of TERMINATE_JOB messages. Of course enqueuing those messages signals the pending condition, but that doesn't guarantee when the _agent_init() thread wakes up. So many messages could be added to the queue before the slurm_cond_timedwait() returns. Only one messages will be dispatched by _agent_retry(), regardless of how many were added by the scheduler. In _agent_init(), the last thing that we do before unlocking the mutex is to reset pending_wait_time too NO_VAL16. The only other place pending_wait_time is set to another value is in agent_trigger. If no other thread enqueues messages while _agent_retry() is running, the value of pending_wait_time will still be NO_VAL16 when we return to the top of the loop, which means we will always enter the slurm_cond_timedwait() and wait for the next signal--even if there are still messages in the queue. But the next signal will arrive only when the a new message is enqueued by some other thread. That might take a while, but it will probably happen eventually since the system is constantly passing messages around. However, queue will never really empty out--it will send one message if another message is enqueued. TLDR: The message queue keeps growing because the consumer side of the producer-consumer pattern is not guaranteed to consume all messages that are enqueued before it returns to waiting. P.S. I'm aware that _agent_retry is a rather complicated piece of code and it is possible that there are messages in that queue that should not be dispatched yet, but that is kind of a different issue. P.P.S. What's rather telling is this comment in controller.c:2135: /* Process any pending agent work */ agent_trigger(RPC_RETRY_INTERVAL, true); That suggests that at one time, someone thought that agent_trigger() was supposed to clear out the queue. Was that true 12 years ago but is not true now? Thanks for your attention, and if you think it is fruitful I'll submit a PR, but would welcome to hear if you think I've got something wrong. Cheers, Conrad Herrmann Sr. Staff Engineer Zoox