We have a problem trying to get more control over how the node() decides what node to allocate an executor on. Specifically, we have a situation where we have a pool of nodes with a specific label, all of which are capable of executing a given task, but with a strong preference to run the task on the same node that ran this task before. (Note that these tasks are simply different pieces of code within a single pipeline, running in parallel.) This is what Jenkins does normally, at job granularity, but as JENKINS-36547 <https://issues.jenkins-ci.org/browse/JENKINS-36547> says, all tasks scheduled from any given pipeline will be given the same hash, which means that the load balancer has no idea which tasks should be assigned to which node. In our situation, only a single pipeline ever assigns jobs to this pool of nodes.
So far we have worked around the issue by assigning a different label to each and every node in the pool in question, but this has a new issue: if any node in that pool goes down for any reason, the task will not be reassigned to any other node, and the whole pipeline will hang or time out. We have worked around *that* by assigning each task to "my-node-pool-# || my-node-pool-fallback", where my-node-pool-fallback is a label which contains a few standby nodes, so that if one of the primary nodes goes down the pipeline as a whole can still complete. It will be slower (these tasks can take two to ten times longer when not running on the same node they ran last time), but it will at least complete. Unfortunately, the label expression doesn't actually mean "first try to schedule on the first node in the OR, then use the second one if the first one is not available." Instead, there will usually be some tasks that schedule on a fallback node even if the node they are "assigned" to is still available. As a result, almost every run of this pipeline ends up taking the worst-case time: it is likely that *some* task will wander away from its assigned node to run on a fallback, which leads the fallback nodes to be over-scheduled and leaves other nodes sitting idle. The question is: what are our options? One hack we've considered is attempting to game the scheduler by using sleep()s: initially schedule all the fallback nodes with a task that does nothing but sleep(), then schedule all our real tasks (which will now go to their assigned machines whenever possible, because the fallback nodes are busy sleeping), and finally let the sleeps complete so that any tasks which couldn't execute on their assigned machines now execute on the fallbacks. A better solution would probably be to create a LoadBalancer plugin that codifies this somehow: preferentially scheduling tasks only on their assigned label, scheduling on fallbacks only after 30 seconds or a minute. Is anyone out there dealing with similar issues, or know of a solution that I have overlooked? Thanks, John Calsbeek -- You received this message because you are subscribed to the Google Groups "Jenkins Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-users/cbf42d50-d0ac-45d4-8525-e65d335b80ee%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.