On Friday, October 28, 2016 at 9:14:53 PM UTC-7, Michael Lasevich wrote:
>
> Is there are way to reduce the need for tasks to run on same slave? I 
> suspect the issue is having data from the last run - if that is the case, 
> is there any shared storage solution that may reduce the time difference? 
> If you can reduce the need for binding tasks to specific nodes, you bypass 
> your entire headache.
>

Shared storage is a potential option, yes, but the tasks in question are 
currently not very fault-tolerant when it comes to network hitches.
 

> A minor point is that you may consider that since a node can have multiple 
> labels, your nodes can have individual labels AND a shared label - meaning 
> your fallback can be shared among all the existing nodes.
>

Yes, we do this in some cases already. But the core issue remains (a node 
that happens to schedule on a fallback node is not running on its preferred 
node).
 

> But more to the point, if your main issue is that you are worried that a 
> node may be unavailable, you may consider some automatic node allocation. I 
> am not sure if there are other examples, but for example the AWS node 
> allocation can automatically allocate a new node if no threads are 
> available for a label. That may be a decent backup strategy. If you are not 
> using AWS - you can probably look if there is another node provisioning 
> plugin that fits or if not, look at how they do that and write your own 
> plugin to do it
>

Assuming that we have a fixed amount of computing resources, does this have 
any advantage over writing a LoadBalancer plugin?
 

> But maybe I am overthinking it. In the end, if your primary concern is 
> that node may be down - remember that pipeline is groovy code - groovy code 
> that has access to the Jenkins API/internals. You can write some code that 
> will check the state of the slaves and select a label to use before you 
> even get to the node() statement. Sure, that will not fix the issue of a 
> node going down in a middle of a job, but may catch the job before it 
> assigns a task to a dead node.
>

Ah, that's an interesting idea. Something that I forgot to mention in the 
original post is that if there was a node() function that allocates with a 
timeout, that would also be a building block that we could use to fix this 
problem. (If attempting to allocate a specific node fails with a timeout, 
then schedule on a fallback. timeout() doesn't work because that would 
apply the timeout to the task as well, not merely to the attempt to 
allocate the node.) We could indeed query the status of nodes directly. I 
have a niggling doubt that it would be possible to do this without a race 
condition (what if the node goes down between querying its status and 
scheduling on it?), but it's definitely something worth investigating.
 

> Alternatively, you can simply write another job, in lieu of a plugin, that 
> will scan all your tasks and nodes and if it detects a node down and a task 
> waiting for it, assign the label to another node from the "standby" pool
>

This is an idea that we had considered, yeah, although I was considering it 
as a first step in the pipeline before scheduling, which made me nervous 
about race conditions. But if, as you suggest, it was a frequently run job 
which is always attempting to set up node allocations… that could 
definitely work. Good suggestion, thanks!

Thanks,
John Calsbeek

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jenkinsci-users/778037e3-b16e-42d9-942b-60e0e1e47bb6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to