[jira] [Commented] (AIRFLOW-3285) lazy marking of upstream_failed task state

Kevin McHale (JIRA) Wed, 07 Nov 2018 12:24:02 -0800


    [ 
https://issues.apache.org/jira/browse/AIRFLOW-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16678729#comment-16678729
 ]


Kevin McHale commented on AIRFLOW-3285:
---------------------------------------

hi [~ashb] thanks for your response, although I don't think I follow how a 
combination trigger rule like that would solve this problem for us?  Or maybe 
it is not clear from my description above that we are not trying to alter any 
actual triggering behavior.  Rather we are trying to alter the side-effect of 
the trigger rule evaluation: namely, the marking of the task as being in the 
{{upstream_failed}} state.

The code mentions in a 
[comment|https://github.com/apache/incubator-airflow/blob/master/airflow/ti_deps/deps/trigger_rule_dep.py#L136-L139]
 that the trigger rules would be better off as separately-implemented classes 
(which I agree with).  If that were done, it would be much easier for us to get 
the logic that we want by defining a new trigger rule class (not a combination 
rule, but a rule that has different {{upstream_failed}} evaluation properties). 
 I could perhaps take on this TODO and restructure the trigger_rule code, if I 
had a commitment that such a change would be accepted into the upstream.

> lazy marking of upstream_failed task state
> ------------------------------------------
>
>                 Key: AIRFLOW-3285
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-3285
>             Project: Apache Airflow
>          Issue Type: Improvement
>            Reporter: Kevin McHale
>            Priority: Minor
>
> Airflow aggressively applies the {{upstream_failed}} task state: as soon as a 
> task fails, all of its downstream dependencies get marked.  This sometimes 
> creates problems for us at Etsy.
> In particular, we use a pattern for our hadoop Airflow DAGs along these lines:
>  # the DAG creates a hadoop cluster in GCP/Dataproc
>  # the DAG executes its tasks on the cluster
>  # the DAG deletes the cluster once all tasks are done
> There are some cases in which the tasks immediately upstream of the 
> cluster-delete step get marked as {{upstream_failed}}, triggering the 
> cluster-delete step, even while other tasks continue to execute without 
> problems on the cluster.  The cluster-delete step of course kills all of the 
> running tasks, requiring all of them to be re-run once the problem with the 
> failed task is mitigated.
> As an example, a DAG that looks like this can exhibit the problem:
> {code:java}
> Cluster = ClusterCreateOperator(...)
> A = Job1Operator(...)
> Cluster << A
> B = Job2Operator(...)
> Cluster << B
> C = Job3Operator(...)
> A << C
> B << C
> ClusterDelete = DeleteClusterOperator(trigger_rule="all_done", ...)
> D << ClusterDelete{code}
> In a DAG like this, suppose task A fails while task B is running.  Task C 
> will immediately be marked as {{upstream_failed}}, which will cause 
> ClusterDelete to run while task B is still running, which will cause task B 
> to also fail.
> Our solution to this problem has been to implement something like [this 
> diff|https://github.com/mchalek/incubator-airflow/commit/585349018656cd9b2e3e3e113db6412345485dde],
>  which lazily applies the {{upstream_failed}} state only to tasks for which 
> all upstream tasks have already completed.
> The consequence in terms of the example above is that task C will not be 
> marked {{upstream_failed}} in response to task A failing until task B 
> completes, ensuring that the cluster is not deleted while any upstream tasks 
> are running.
> We find this not to have any adverse behavior on our airflow instances, so we 
> run all of them with this lazy-marking feature enabled.  However, we 
> recognize that a change in behavior like this may be something that existing 
> users will want to opt-in for, so we included a config flag in the diff that 
> defaults to the original behavior.
> We would appreciate your consideration of incorporating this diff, or 
> something like it, to allow us to configure this behavior in unmodified, 
> upstream airflow.
> Thanks!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AIRFLOW-3285) lazy marking of upstream_failed task state

Reply via email to