Like what Jarek said, some of these dependencies might take a lot of work to surface correctly. But I am happy to improve the grid and graph to show more information, like integrating rendered_templates and more details into the Grid view. Mind to open a github issue for some of those smaller tasks so I don't forget to do it?
I am also playing with some ways to show datasets and other external dependencies better in grid/graph view too. On Thu, Oct 19, 2023 at 10:48 AM Jarek Potiuk <ja...@potiuk.com> wrote: > I think it will be tricky to get all the reasons surfaced to the user why > the task is not run. But surfacing it to the user is indeed a good idea. > Currently this is only done by this FAQ response - showing possible reasons > > https://airflow.apache.org/docs/apache-airflow/stable/faq.html#why-is-task-not-getting-scheduled > - and I believe this is not a complete list after a number of > features implemented since this FAQ was written. > > The question is open I think (and agree with Jens comments this should be a > small "AIP" level) is which of those we are able to deterministically > detect. A bit of a problem here is (also as Jens mentioned) that in many > cases the task in DB is simply skipped during scheduler because of some of > the reasons explained in the FAQ (and some not explained). Sometimes > simply the task will not be scheduled because the scheduler has not yet had > a chance to look at it due to performance reasons. That's why I believe we > really do not need a new status, but more automated analysis - in the "more > details" tab, when the user specifically asks for it. That could give the > user possible reasons for this particular task. This would be much better > to do it on "individual" task level when users asks "why this particular > task is not scheduled" - because then you could query the DB and figure it > out, recording and determining the information upfront might not be > possible from the performance reasons - simply because scheduler never > really looks at all possible tasks (that would be prohibitively expensive) > - instead it effectively finds a subset the "good candidates to schedule" - > which is much smaller set to run queries for. > > Some of that could be deterministically determined today. For example the > "upstream tasks are still running". Some of that might be a little "racy" > though - because simply the system is continuously running - so what caused > the task to not be scheduled in the previous pass of scheduler, might not > be valid any more (but there might still be other reasons). I think the > difficult ones might require additional information recorded by the > scheduler (for example scheduler recording the fact that it has completed > the last pass with still remaining dag runs to look at or fact that the > number of tasks seen in the last pass reached the global concurrency > limits). But some of this might not be even possible to determine by > scheduler without some major query changes (for example scheduler will run > the query including pools size - the way how pool query is done that you > simply select "pool size" eligible tasks and you have no idea if there were > more that there are more tasks that were excluded from the result (nor > which tasks they were). This is where looking at individual tasks and > working out "backwards" - guessing why might be needed. But possibly it > could be helped with some extra information stored by the scheduler. > > I think we will not have a complete and fully accurate picture, but I think > iteratively we could get this better and better. > > J > > > On Mon, Oct 16, 2023 at 11:55 PM Oliveira, Niko > <oniko...@amazon.com.invalid> > wrote: > > > I really like this idea as well! One of the _the most common_ questions I > > get from people managing an Airflow env is "Why is my task stuck in state > > X". Anything we can do to make that more discoverable and user friendly, > > especially in the UI instead of (or in addition to) logs would be > fantastic! > > > > Thanks to Jens for having a think and pointing out a lot of the > > implications, I agree a quick AIP might be nice for this one. > > > > Cheers, > > Niko > > > > ________________________________ > > From: Scheffler Jens (XC-DX/ETV5) <jens.scheff...@de.bosch.com.INVALID> > > Sent: Thursday, September 28, 2023 10:36:00 PM > > To: dev@airflow.apache.org > > Subject: RE: [EXTERNAL] [COURRIEL EXTERNE] The "no_status" state > > > > CAUTION: This email originated from outside of the organization. Do not > > click links or open attachments unless you can confirm the sender and > know > > the content is safe. > > > > > > > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. > > Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne > pouvez > > pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain > que > > le contenu ne présente aucun risque. > > > > > > > > Hi Ryan, > > > > I really like the idea of exposing some more scheduler details. More > > transparency in scheduling also in the UI would help the user in (1) > seeing > > and understanding what is going on and (2) reduces the need to crawl for > > logs and raise support tickets if status is “strange”. I often also see > > this as a problem. This is also sometimes generating a bit of “mis trust” > > in the scheduler stability. > > > > From point of scheduler “overhead” I assume as long as we are not making > a > > “full scan” just to ensure that each and every task is always up-to-date > > (Scheduler stops processing today after enough tasks have been processes > in > > a loop or if scheduling limits are reached) this is OK for me and on the > > code side does not seem to be much overhead. > > I have a bit of fear on the other hand that very many frequent updates > > need to happen on the DB as another state would need to be written. So > more > > DB round trips are needed. This might hit performance for large DAGs or > > cases where DAGs are scheduled. So at least it would need to filter to > > update the state to DB only if changed to keep performance impact > minimal. > > > > From point of naming I still think “no status” is good to indicate that > > scheduler did not digest anything, maybe task was never looked at because > > scheduler actually is really stuck or too busy getting there. I would > > propose if scheduler passes along a task and decides that it is not ready > > to schedule to have an additional state calling e.g. “not_ready” in the > > state model between “none” and “scheduled”. > > > > Finally on the other hand, adding another state in the model, I am not > > sure whether this 100% will help in the use case described by you. Still > > you might need to scratch your head a while if taking a look on UI that a > > DAG is “stuck” until you realize all the options you have configured. > > Exposing a “why is stuck” in a user friendly manner might be another > level > > of complexity in this case. > > > > As the state model might touch a lot of code and there might be a longer > > discussion needed, would it be a need to raise an AIP for this? There > might > > be a lot more (external, provider??) dependencies adjusting the state > model? > > > > Mit freundlichen Grüßen / Best regards > > > > Jens Scheffler > > > > Deterministik open Loop (XC-DX/ETV5) > > Robert Bosch GmbH | Hessbruehlstraße 21 | 70565 Stuttgart-Vaihingen | > > GERMANY | www.bosch.com<http://www.bosch.com> > > Tel. +49 711 811-91508 | Mobil +49 160 90417410 | > > jens.scheff...@de.bosch.com<mailto:jens.scheff...@de.bosch.com> > > > > Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; > > Aufsichtsratsvorsitzender: Prof. Dr. Stefan Asenkerschbaumer; > > Geschäftsführung: Dr. Stefan Hartung, > > Dr. Christian Fischer, Dr. Markus Forschner, Stefan Grosch, Dr. Markus > > Heyn, Dr. Tanja Rückert > > > > From: Ryan Hatter <ryan.hat...@astronomer.io.INVALID> > > Sent: Donnerstag, 28. September 2023 23:59 > > To: dev@airflow.apache.org > > Subject: The "no_status" state > > > > Over the last couple weeks I've come across a rather tricky problem a few > > times. One DAG run gets "stuck" in the queued state, while subsequent DAG > > runs will be stuck running (screenshot below). One of these issues was > > caused by `max_active_runs` being met when a task instance from a > > previously run DAG was cleared, and one of the tasks had > > `depends_on_past=True`. This caused the DAG run to be stuck in queued in > > perpetuity until it was realized that the task that wasn't getting > > scheduled needed the failed task in the preceding DAG run to be re-run, > > which in turn causes the stuck running DAG runs to be stuck in running. > > which caused quite a bit of confusion and stress. > > > > Given that Airflow is pretty burnt out on task instance states and > colors, > > I propose replacing "no_status" with "dependencies_not_met" and surfacing > > dependencies in the grid view instead of forcing users to already know > > where to look (i.e. "more details" task instance details). Now that I > typed > > it out, I'm not sure there should be a reason for the "more details" > button > > and not just laying out all of a task instance's details in the grid view > > similar to how the graph and code views are now included in the grid > view. > > > > Anyway, I wanted to solicit feedback before I open an issue / start work > > on this. > > > > [cid:ii_ln3phzoe0] > > >