Re: The "no_status" state

Brent Bovenzi Thu, 19 Oct 2023 08:05:30 -0700

Like what Jarek said, some of these dependencies might take a lot of work
to surface correctly. But I am happy to improve the grid and graph to show
more information, like integrating rendered_templates and more details into
the Grid view. Mind to open a github issue for some of those smaller tasks
so I don't forget to do it?


I am also playing with some ways to show datasets and other external
dependencies better in grid/graph view too.

On Thu, Oct 19, 2023 at 10:48 AM Jarek Potiuk <ja...@potiuk.com> wrote:

> I think it will be tricky to get all the reasons surfaced to the user why
> the task is not run. But surfacing it to the user is indeed a good idea.
> Currently this is only done by this FAQ response - showing possible reasons
>
> https://airflow.apache.org/docs/apache-airflow/stable/faq.html#why-is-task-not-getting-scheduled
> - and I believe this is not a complete list after a number of
> features implemented since this FAQ was written.
>
> The question is open I think (and agree with Jens comments this should be a
> small "AIP" level) is which of those we are able to deterministically
> detect. A bit of a problem here is (also as Jens mentioned) that in many
> cases the task in DB is simply skipped during scheduler because of some of
> the reasons explained  in the FAQ (and some not explained). Sometimes
> simply the task will not be scheduled because the scheduler has not yet had
> a chance to look at it due to performance reasons. That's why I believe we
> really do not need a new status, but more automated analysis - in the "more
> details" tab, when the user specifically asks for it. That could give the
> user possible reasons for this particular task. This would be much better
> to do it on "individual" task level when users asks "why this particular
> task is not scheduled" - because then you could query the DB and figure it
> out, recording and determining the information upfront might not be
> possible from the performance reasons - simply because scheduler never
> really looks at all possible tasks (that would be prohibitively expensive)
> - instead it effectively finds a subset the "good candidates to schedule" -
> which is much smaller set to run queries for.
>
> Some of that could be deterministically determined today. For example the
> "upstream tasks are still running". Some of that might be a little "racy"
> though - because simply the system is continuously running - so what caused
> the task to not be scheduled in the previous pass of scheduler, might not
> be valid any more (but there might still be other reasons). I think the
> difficult ones might require additional information recorded by the
> scheduler (for example scheduler recording the fact that it has completed
> the last pass with still remaining dag runs to look at or fact that the
> number of tasks seen in the last pass reached the global concurrency
> limits). But some of this might not be even possible to determine by
> scheduler without some major query changes (for example scheduler will run
> the query including pools size - the way how pool query is done that you
> simply select "pool size" eligible tasks and you have no idea if there were
> more that there are more tasks that were excluded from the result (nor
> which tasks they were). This is where looking at individual tasks and
> working out "backwards" - guessing why might be needed. But  possibly it
> could be helped with some extra information stored by the scheduler.
>
> I think we will not have a complete and fully accurate picture, but I think
> iteratively we could get this better and better.
>
> J
>
>
> On Mon, Oct 16, 2023 at 11:55 PM Oliveira, Niko
> <oniko...@amazon.com.invalid>
> wrote:
>
> > I really like this idea as well! One of the _the most common_ questions I
> > get from people managing an Airflow env is "Why is my task stuck in state
> > X". Anything we can do to make that more discoverable and user friendly,
> > especially in the UI instead of (or in addition to) logs would be
> fantastic!
> >
> > Thanks to Jens for having a think and pointing out a lot of the
> > implications, I agree a quick AIP might be nice for this one.
> >
> > Cheers,
> > Niko
> >
> > ________________________________
> > From: Scheffler Jens (XC-DX/ETV5) <jens.scheff...@de.bosch.com.INVALID>
> > Sent: Thursday, September 28, 2023 10:36:00 PM
> > To: dev@airflow.apache.org
> > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] The "no_status" state
> >
> > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> >
> >
> >
> > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe.
> > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne
> pouvez
> > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain
> que
> > le contenu ne présente aucun risque.
> >
> >
> >
> > Hi Ryan,
> >
> > I really like the idea of exposing some more scheduler details. More
> > transparency in scheduling also in the UI would help the user in (1)
> seeing
> > and understanding what is going on and (2) reduces the need to crawl for
> > logs and raise support tickets if status is “strange”. I often also see
> > this as a problem. This is also sometimes generating a bit of “mis trust”
> > in the scheduler stability.
> >
> > From point of scheduler “overhead” I assume as long as we are not making
> a
> > “full scan” just to ensure that each and every task is always up-to-date
> > (Scheduler stops processing today after enough tasks have been processes
> in
> > a loop or if scheduling limits are reached) this is OK for me and on the
> > code side does not seem to be much overhead.
> > I have a bit of fear on the other hand that very many frequent updates
> > need to happen on the DB as another state would need to be written. So
> more
> > DB round trips are needed. This might hit performance for large DAGs or
> > cases where DAGs are scheduled. So at least it would need to filter to
> > update the state to DB only if changed to keep performance impact
> minimal.
> >
> > From point of naming I still think “no status” is good to indicate that
> > scheduler did not digest anything, maybe task was never looked at because
> > scheduler actually is really stuck or too busy getting there. I would
> > propose if scheduler passes along a task and decides that it is not ready
> > to schedule to have an additional state calling e.g. “not_ready” in the
> > state model between “none” and “scheduled”.
> >
> > Finally on the other hand, adding another state in the model, I am not
> > sure whether this 100% will help in the use case described by you. Still
> > you might need to scratch your head a while if taking a look on UI that a
> > DAG is “stuck” until you realize all the options you have configured.
> > Exposing a “why is stuck” in a user friendly manner might be another
> level
> > of complexity in this case.
> >
> > As the state model might touch a lot of code and there might be a longer
> > discussion needed, would it be a need to raise an AIP for this? There
> might
> > be a lot more (external, provider??) dependencies adjusting the state
> model?
> >
> > Mit freundlichen Grüßen / Best regards
> >
> > Jens Scheffler
> >
> > Deterministik open Loop (XC-DX/ETV5)
> > Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen |
> > GERMANY | www.bosch.com<http://www.bosch.com>
> > Tel. +49 711 811-91508 | Mobil +49 160 90417410 |
> > jens.scheff...@de.bosch.com<mailto:jens.scheff...@de.bosch.com>
> >
> > Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
> > Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer;
> > Geschäftsführung: Dr. Stefan Hartung,
> > Dr. Christian Fischer, Dr. Markus Forschner, Stefan Grosch, Dr. Markus
> > Heyn, Dr. Tanja Rückert
> > 
> > From: Ryan Hatter <ryan.hat...@astronomer.io.INVALID>
> > Sent: Donnerstag, 28. September 2023 23:59
> > To: dev@airflow.apache.org
> > Subject: The "no_status" state
> >
> > Over the last couple weeks I've come across a rather tricky problem a few
> > times. One DAG run gets "stuck" in the queued state, while subsequent DAG
> > runs will be stuck running (screenshot below). One of these issues was
> > caused by `max_active_runs` being met when a task instance from a
> > previously run DAG was cleared, and one of the tasks had
> > `depends_on_past=True`. This caused the DAG run to be stuck in queued in
> > perpetuity until it was realized that the task that wasn't getting
> > scheduled needed the failed task in the preceding DAG run to be re-run,
> > which in turn causes the stuck running DAG runs to be stuck in running.
> > which caused quite a bit of confusion and stress.
> >
> > Given that Airflow is pretty burnt out on task instance states and
> colors,
> > I propose replacing "no_status" with "dependencies_not_met" and surfacing
> > dependencies in the grid view instead of forcing users to already know
> > where to look (i.e. "more details" task instance details). Now that I
> typed
> > it out, I'm not sure there should be a reason for the "more details"
> button
> > and not just laying out all of a task instance's details in the grid view
> > similar to how the graph and code views are now included in the grid
> view.
> >
> > Anyway, I wanted to solicit feedback before I open an issue / start work
> > on this.
> >
> > [cid:ii_ln3phzoe0]
> >
>

Re: The "no_status" state

Reply via email to