Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Dan Davydov Wed, 16 Oct 2019 08:43:13 -0700

Responses inline.

On Wed, Oct 16, 2019 at 6:07 AM Ash Berlin-Taylor <a...@apache.org> wrote:


> Thanks for the feedback and discussion everyone - it's nice to know people
> feel strongly about this. Let's make sure we build this right!
>
> As a reminder: the main goals of this AIP:
>
> - To speed up bootup/worker recycle time of the webserver
>
>   If you have a large number of DAGs or highly dynamic DAGs it is possible
> that starting your webserver needs a Gunicorn worker timeout of 5 or 10
> minutes. (It is worse at start up because each of the gunicorn workers
> parses the dag itself, in exactly the same order, at the same time as the
> other workers, leading to disk contention, and CPU burst.)
>
>   And just yesterday I helped two people diagnose a problem that was
> caused by slow NFS: > time airflow list_dags` took 3:35mins to go over 43
> dags.
>
>   At Astronomer, and Google (Cloud Composer) we've seen this slow loading
> many times for our customer's installs.
>
>   And not to mention this makes the webserver snappier
>
> - To make the webserver almost "stateless" (not the best term for it:
> Diskless perhaps? Dagbag-less?).
>
>   This makes actions on the webserver quicker. We had a number of PRs in
> the 1.10 series that made views and API endpoints in the webserver quicker
> By making them only load one dag from disk. This takes it further and means
> that (as of the PR) only the Rendered tab needs to go to the code on disk.
> As a reminder the "trigger_dag" endpoint would previously load all dags.
> Other's still do. Multiple people on have complained about this Slack.
>
>   (The only remaining view that has to hit the code on disk is the
> Rendered template tab, as right now a DAG can define custom macros or
> filters, and we don't want to have to deal with serializing code. The only
> "real" way of doing it is CloudPickle. No thank you)
>
>
> To address the points .
>
>
> On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl> wrote
> > - Are we going to extend the existing data model, to allow the RDBMS to
> > optimize queries on fields that we use a lot?
>
> Yes, as needed. Not really needed for this PR yet (it doesn't change any
> query patterns in the webserver, so they are using the existing tables)
>
> > - How are we going to do state evolution when we extend the JSON model
> >
>
> The top level object we're storing has a __version field (for example
> `{"__version": 1, "dag": { ... } }`) so we can detect older versions and
> either run a db "migration" on install, or upgrade at load time.
>
>
>
> On 15 Oct 2019, at 20:04, Dan Davydov <ddavy...@twitter.com.invalid>
> wrote:
> > Having both a SimpleDagBag representation and the JSON representation
> > doesn't make sense to me at the moment: *"**Quoting from Airflow code, it
> > is “a simplified representation of a DAG that contains all attributes
> > required for instantiating and scheduling its associated tasks.”. It does
> > not contain enough information required by the webserver.". *Why not
> create
> > a representation that can be used by both? This is going to be a big
> > headache to both understand and work with in the codebase since it will
> be
> > another definition that we need to keep in sync.
>
> Honestly: because that makes the change bigger, harder to review and
> harder to back port. If this is a serious blocker to you agreeing we can
> update it to use the same structure as part of this PR, but I would rather
> it was a separate change.
>
I think this is the kind of core change we should make from the get-go, my
feeling is that we will be paying for this decision otherwise, and the cost
will surpass the immediate value. Ideally we would fully understand the
long-term plan before deciding on this serialization format, but I think at
least having a single serialized representation for now would be a decent
compromise for me.

This is still a blocker to me agreeing.


>
> > Not sure if fileloc/fileloc_hash is the right solution, the longterm
> > solution I am imagining has clients responsible for uploading DAGs rather
> > than retrieving them from the filesystem so fileloc/fileloc_hash wouldn't
> > even exist (dag_id would be used for indexing here).
>
> Then we can remove it later, when that code exists. For now so much of the
> rest of the code base assumes DAGs from files on disk so.
>
>
>
> > Versioning isn't really addressed either (e.g. if a DAG topology changes
> > with some update you want to be able to show both the old and new ones,
> or
> > at least have a way to deal with them), there is an argument that this is
> > acceptable since it isn't currently addressed now, but I'm worried that
> > large schema changes should think through the long term plan a bit more.
>
> Correct, and this is by design to make the change smaller and easier to
> review. Adding versioning using the existing serialization format is not
> that much work once the ground work is here and the webserver is not
> relying on DAGs from disk. There's also a UI question of how to handle this
> on the Tree view that I don't have an immediate answer for. But yes,
> versioning of dag run history was in our minds when working on this PR.
>
> As long as all of the changes are in the camp of "easy to migrate/undo in
the future" then these are not blockers for me. I'm not 100% convinced that
they are, but you have more context so I'll defer to your judgment.


> > I feel like getting this wrong is going to make it very hard to migrate
> > things in the future, and make the codebase worse (another representation
> > of DAGs that we need to support/understand/keep parity for). If I'm wrong
> > about this then I would be more willing to +1 this change. This doc is a
> > 1-2 pager and I feel like it is not thorough or deep enough and doesn't
> > give me enough confidence that the work in the PR is going to make it
> > easier to complete the future milestones instead of harder.
>
> "But what if we get it wrong" is not something we can act on. So long as
> you are happy with the choice of JSON as the serialization then since we've
> got a version in the record we can change this relatively easily in the
> future - we'll know when we're dealing with an older record (we won't have
> to guess) and can write a format migrator (either to upgrade in memory, or
> to upgrade the record in place. And as a db "migration" or as a upgrade
> command)
>
I'm happy with JSON, my main point about "what if we get it wrong" is that
I feel like more planning work could have been put into this project vs
just a couple of rough high-level target "milestones". This significantly
increases the risk of the project long term in my eyes.


>
> We've based our serialization on Uber's Piper work <
> https://eng.uber.com/managing-data-workflows-at-scale/ <
> https://eng.uber.com/managing-data-workflows-at-scale/>> (we had a call
> with Alex at Uber before starting any of this work) so we've piggy backed
> of their hard work and months of running something very close to this
> format in production.


>
This is great to hear!


> Also let's not forget: right now we aren't storing versions, and we still
> have DAGs on disk. If we do get it massively wrong (which I don't think we
> have) we can just drop this entire table and start again. This is all an
> internal implementation detail that is not exposed via an API. What better
> way to see if this is right than by running it at scale in production
> across 1000s of customers (Astronomer+Cloud Composer)
>
If we have thought through future milestones and made sure that we can
switch to different formats that could required for them easily, then I'm
not a blocker on this front.


> Can you be more specific about you are worried about here? It's a bit hard
> to put general "but what if we get it wrong" fears to rest.
>
>
>
> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <alex.guz...@airbnb.com
> .invalid>
> > We don't need to have the future plan implemented
> > completely but it would be nice to see more detailed notes about how this
> > will play out in the future.
>
> Fair point. Our rough long-term plan:
>
> - Add versioning to the UI, likely by storing a new version linked from
> the DagRun, (but only when it changes, not every time to avoid DB bloat -
> multiple dag_run rows will point at the same serialized blob).
>
> - Rework the scheduler use the serialized version, not the Simple*
> version. assuming we can do it and keep the scheduler performing well.
> (i.e. that ser/deser time isn't significant for the scheduler loop)
>
>
> - Massively rework how scheduler creates TaskInstances (right now this is
> done in the Dag parsing process, not the main scheduler) and dag runs. We
> have to keep an eye on scheduler performance as we do this. If we can pull
> this back in to the main scheduler then this leads us towards the next
> points:
>
> - Split out the updating of serialized dag/parsing of DAGs from the
> scheduler loop so it can be a separate component/subprocess (which is on by
> default). This gets us towards being able to submit DAG definitions via an
> API if we wanted. (There's a lot to work out to get that working, mostly
> around running the DAG).
>
> - Once the scheduler loop/logic is not tied to parsing the DAGs then this
> also heads us towards being able to run multiple schedulers. (This isn't a
> requirement on HA scheduler, but it makes them easier to "scale out".)
>
> Hope this answers people's questions, and thanks for the feedback
>
> Alex and Dan: Does this give enough information for you to change your
> vote?
>
I think as long as we combine the serialized DAG representations and have
some confidence that the data format will be easy to switch to future
potential formats we would want (kind of hard to do without a more detailed
future plan...), then I would happily change my vote.


> -ash
>
> > On 15 Oct 2019, at 21:14, Driesprong, Fokko <fo...@driesprong.frl>
> wrote:
> >
> > Big +1 from my side, looking forward to make this happen.
> >
> > Two sides that aren't completely clear to me:
> >
> >   - Are we going to extend the existing data model, to allow the RDBMS to
> >   optimize queries on fields that we use a lot?
> >   - How are we going to do state evolution when we extend the JSON model
> >
> > I have good confidence that we'll solve this along the way.
> >
> > Cheers, Fokko
> >
> > Op di 15 okt. 2019 om 21:29 schreef Dan Davydov
> > <ddavy...@twitter.com.invalid>:
> >
> >> I have been following it from the beginning as well. I understand there
> >> would be short-term wins for some users (I don't think a huge amount of
> >> users?), but I still feel like we are being a bit short-sighted here and
> >> that we are creating more work for ourselves and potentially our users
> in
> >> the future. I also feel like there will be side effects to users as
> well,
> >> many of which don't care about the webserver scalability, such as bugs
> >> caused by the addition of the new webserver representation. I think
> without
> >> a design that is much larger in scope I wouldn't feel comfortable moving
> >> forward with this AIP.
> >>
> >> On Tue, Oct 15, 2019 at 3:21 PM Jarek Potiuk <jarek.pot...@polidea.com>
> >> wrote:
> >>
> >>> Hello Dan, Alex,
> >>>
> >>> I believe all the points you make are super-valid ones. But maybe you
> are
> >>> missing the full context a bit.
> >>>
> >>> I followed the original discussion
> >>> <
> >>>
> >>
> https://lists.apache.org/thread.html/a2d426f93c0f4e5f0347627308638b59ca4072fd022a42af1163e34a@%3Cdev.airflow.apache.org%3E
> >>>>
> >>> from the very beginning and took part in the initial discussions when
> >> this
> >>> topic was raised. From the discussion it is quite clear to me that this
> >>> mostly a "tactical" approach to implement something that is
> backportable
> >> to
> >>> 1.10 and rather quick to implement. This is targeted to make users more
> >>> happy with their 1.10 version without the timing uncertainty and effort
> >> of
> >>> migration to 2.0. It solves the major pain point of stability of the UI
> >> in
> >>> case there are complex DAGs for which parsing crashes the webserver.
> >>> Like in "being nice to your users".
> >>>
> >>> There will be a separate effort to make pretty much all of the things
> you
> >>> mentioned in 2.0 in a non-backportable way as it requires far too many
> >>> changes in the way how Airflow works internally.
> >>>
> >>> Maybe it needs some more explanation + long term plan that follows in
> the
> >>> AIP itself to explain it to those who have not followed the initial
> >>> discussion, but I think it's fully justified change.
> >>>
> >>> J.
> >>>
> >>> On Tue, Oct 15, 2019 at 9:10 PM Alex Guziel <alex.guz...@airbnb.com
> >>> .invalid>
> >>> wrote:
> >>>
> >>>> -1 (binding)
> >>>> Good points made by Dan. We don't need to have the future plan
> >>> implemented
> >>>> completely but it would be nice to see more detailed notes about how
> >> this
> >>>> will play out in the future. We shouldn't walk into a system that
> >> causes
> >>>> more pain in the future. (I can't say for sure that it does, but I
> >> can't
> >>>> say that it doesn't either). I don't think the proposal is necessarily
> >>>> wrong or bad, but I think we need some more detailed planning around
> >>> future
> >>>> milestones.
> >>>>
> >>>> On Tue, Oct 15, 2019 at 12:04 PM Dan Davydov
> >>> <ddavy...@twitter.com.invalid
> >>>>>
> >>>> wrote:
> >>>>
> >>>>> -1 (binding), this may sound a bit FUD-y but I don't feel this has
> >> been
> >>>>> thought through enough...
> >>>>>
> >>>>> Having both a SimpleDagBag representation and the JSON representation
> >>>>> doesn't make sense to me at the moment: *"**Quoting from Airflow
> >> code,
> >>> it
> >>>>> is “a simplified representation of a DAG that contains all attributes
> >>>>> required for instantiating and scheduling its associated tasks.”. It
> >>> does
> >>>>> not contain enough information required by the webserver.". *Why not
> >>>> create
> >>>>> a representation that can be used by both? This is going to be a big
> >>>>> headache to both understand and work with in the codebase since it
> >> will
> >>>> be
> >>>>> another definition that we need to keep in sync.
> >>>>>
> >>>>> Not sure if fileloc/fileloc_hash is the right solution, the longterm
> >>>>> solution I am imagining has clients responsible for uploading DAGs
> >>> rather
> >>>>> than retrieving them from the filesystem so fileloc/fileloc_hash
> >>> wouldn't
> >>>>> even exist (dag_id would be used for indexing here).
> >>>>>
> >>>>> Versioning isn't really addressed either (e.g. if a DAG topology
> >>> changes
> >>>>> with some update you want to be able to show both the old and new
> >> ones,
> >>>> or
> >>>>> at least have a way to deal with them), there is an argument that
> >> this
> >>> is
> >>>>> acceptable since it isn't currently addressed now, but I'm worried
> >> that
> >>>>> large schema changes should think through the long term plan a bit
> >>> more.
> >>>>>
> >>>>> I feel like getting this wrong is going to make it very hard to
> >> migrate
> >>>>> things in the future, and make the codebase worse (another
> >>> representation
> >>>>> of DAGs that we need to support/understand/keep parity for). If I'm
> >>> wrong
> >>>>> about this then I would be more willing to +1 this change. This doc
> >> is
> >>> a
> >>>>> 1-2 pager and I feel like it is not thorough or deep enough and
> >> doesn't
> >>>>> give me enough confidence that the work in the PR is going to make it
> >>>>> easier to complete the future milestones instead of harder.
> >>>>>
> >>>>> On Tue, Oct 15, 2019 at 11:26 AM Kamil Breguła <
> >>>> kamil.breg...@polidea.com>
> >>>>> wrote:
> >>>>>
> >>>>>> +1 (binding)
> >>>>>>
> >>>>>> On Tue, Oct 15, 2019 at 2:57 AM Kaxil Naik <kaxiln...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>>> Hello, Airflow community,
> >>>>>>>
> >>>>>>> This email calls for a vote to add the DAG Serialization feature
> >> at
> >>>>>>> https://github.com/apache/airflow/pull/5743.
> >>>>>>>
> >>>>>>> *AIP*:
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-24+DAG+Persistence+in+DB+using+JSON+for+Airflow+Webserver+and+%28optional%29+Scheduler
> >>>>>>>
> >>>>>>> *Previous Mailing List discussion*:
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://lists.apache.org/thread.html/65d282368e0a7c19815badb8b1c6c8d72b0975ce94f601e13af44f74@%3Cdev.airflow.apache.org%3E
> >>>>>>> .
> >>>>>>>
> >>>>>>> *Authors*: Kaxil Naik, Zhou Fang, Ash-Berlin Taylor
> >>>>>>>
> >>>>>>> *Summary*:
> >>>>>>>
> >>>>>>>   - DAGs are serialized using JSON format and stored in a
> >>>>> SerializedDag
> >>>>>>>   table
> >>>>>>>   - The Webserver now instead of having to parse the DAG file
> >>> again,
> >>>>>>>   reads the serialized DAGs in JSON, de-serializes them and
> >>> creates
> >>>>> the
> >>>>>>>   DagBag and uses it to show in the UI.
> >>>>>>>   - Instead of loading an entire DagBag when the WebServer
> >> starts
> >>> we
> >>>>>>>   only load each DAG on demand from the Serialized Dag table.
> >> This
> >>>>> helps
> >>>>>>>   reduce Webserver startup time and memory. The reduction is
> >>> notable
> >>>>>> when you
> >>>>>>>   have a large number of DAGs.
> >>>>>>>   - A JSON Schema has been defined and we validate the
> >> serialized
> >>>> dag
> >>>>>>>   before writing it to the database
> >>>>>>>
> >>>>>>> [image: image.png]
> >>>>>>>
> >>>>>>> A PR (https://github.com/apache/airflow/pull/5743) is ready for
> >>>> review
> >>>>>>> from the committers and community.
> >>>>>>>
> >>>>>>> We also have a WIP PR (
> >> https://github.com/apache/airflow/pull/5992
> >>> )
> >>>> to
> >>>>>>> backport this feature to 1.10.* branch.
> >>>>>>>
> >>>>>>> A big thank you to Zhou and Ash for their continuous help in
> >>>> improving
> >>>>>>> this feature/PR.
> >>>>>>>
> >>>>>>> This email is formally calling for a vote to accept the AIP and
> >> PR.
> >>>>>> Please
> >>>>>>> note that we will update the PR / feature to fix bugs if we find
> >>> any.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Kaxil
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>> Jarek Potiuk
> >>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>>
> >>> M: +48 660 796 129 <+48660796129>
> >>> [image: Polidea] <https://www.polidea.com/>
> >>>
> >>
>
>

Re: [VOTE] AIP-24: Persisting serialized DAG in DB for webserver scalability

Reply via email to