We have a problem trying to get more control over how the node() decides 
what node to allocate an executor on. Specifically, we have a situation 
where we have a pool of nodes with a specific label, all of which are 
capable of executing a given task, but with a strong preference to run the 
task on the same node that ran this task before. (Note that these tasks are 
simply different pieces of code within a single pipeline, running in 
parallel.) This is what Jenkins does normally, at job granularity, but as 
JENKINS-36547 <https://issues.jenkins-ci.org/browse/JENKINS-36547> says, 
all tasks scheduled from any given pipeline will be given the same hash, 
which means that the load balancer has no idea which tasks should be 
assigned to which node. In our situation, only a single pipeline ever 
assigns jobs to this pool of nodes.

So far we have worked around the issue by assigning a different label to 
each and every node in the pool in question, but this has a new issue: if 
any node in that pool goes down for any reason, the task will not be 
reassigned to any other node, and the whole pipeline will hang or time out.

We have worked around *that* by assigning each task to "my-node-pool-# || 
my-node-pool-fallback", where my-node-pool-fallback is a label which 
contains a few standby nodes, so that if one of the primary nodes goes down 
the pipeline as a whole can still complete. It will be slower (these tasks 
can take two to ten times longer when not running on the same node they ran 
last time), but it will at least complete.

Unfortunately, the label expression doesn't actually mean "first try to 
schedule on the first node in the OR, then use the second one if the first 
one is not available." Instead, there will usually be some tasks that 
schedule on a fallback node even if the node they are "assigned" to is 
still available. As a result, almost every run of this pipeline ends up 
taking the worst-case time: it is likely that *some* task will wander away 
from its assigned node to run on a fallback, which leads the fallback nodes 
to be over-scheduled and leaves other nodes sitting idle.

The question is: what are our options? One hack we've considered is 
attempting to game the scheduler by using sleep()s: initially schedule all 
the fallback nodes with a task that does nothing but sleep(), then schedule 
all our real tasks (which will now go to their assigned machines whenever 
possible, because the fallback nodes are busy sleeping), and finally let 
the sleeps complete so that any tasks which couldn't execute on their 
assigned machines now execute on the fallbacks. A better solution would 
probably be to create a LoadBalancer plugin that codifies this somehow: 
preferentially scheduling tasks only on their assigned label, scheduling on 
fallbacks only after 30 seconds or a minute.

Is anyone out there dealing with similar issues, or know of a solution that 
I have overlooked?

Thanks,
John Calsbeek

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-users/cbf42d50-d0ac-45d4-8525-e65d335b80ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to