Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

Jorge Cardoso Leitão Tue, 16 Mar 2021 10:22:21 -0700

Hi,

I agree.


I skimmed through the code and it is a solid addition to me, so, again,
thank you so much for donating it to Arrow.

I look very much forward to learning more on how to build a scheduler that
can adapt to both in-node and across-node executions :-)

Best,
Jorge


On Tue, Mar 16, 2021 at 3:21 PM Andy Grove <andygrov...@gmail.com> wrote:

> Thank you for all the responses so far. Based on this thread and the
> conversations happening in the Ballista project, I would say that the
> feedback is mostly positive and supportive of this donation, so I have
> started work on a PR [1] and will start a VOTE email thread once the PR is
> ready for review. Assuming that the vote passes, we will need to go through
> the IP clearance process.
>
> To summarize the feedback so far (at least, my interpretation of it):
>
>    -
>
>    There is consensus that it makes sense for DataFusion and Ballista to
>    live in the same repository so that we can keep them tightly coupled and
>    have queries scale across cores in DataFusion and across nodes in
> Ballista
>    using the same scheduler.
>    -
>
>    There is a desire to have a separate release process for DataFusion and
>    Ballista that is not aligned with core Arrow releases, and it appears
> that
>    there is no reason we cannot achieve this with Ballista living in the
> Arrow
>    repo.
>    -
>
>    There is a desire to have DataFusion and Ballista eventually live in a
>    new repo, separate from core Arrow. There is no objection to still being
>    under Arrow governance. If this donation goes through, I expect we will
> be
>    discussing this point again at some point in the future, when Ballista
> is
>    more mature.
>    -
>
>    There is some concern that this will add an additional burden to current
>    Arrow maintainers. I think we can reduce this burden by not requiring
> that
>    Ballista depend on the HEAD version of DataFusion/Arrow i.e. we start
> doing
>    some dependency management within the Rust project.
>    -
>
>    There is some concern that being part of Arrow will make Ballista less
>    attractive to contribute to due to the perceived "bureaucracy" of the
> ASF
>    process, such as requiring JIRA tickets to be filed, and the (current)
>    infrequent release cycles. I think this concern can be reduced over
> time.
>    There were similar concerns when DataFusion was donated and that project
>    seems to be thriving.
>
> Thanks,
>
> Andy.
>
> [1] https://github.com/apache/arrow/pull/9723
>
> On Thu, Mar 11, 2021 at 1:39 PM Andy Grove <andygrov...@gmail.com> wrote:
>
> > Hi Jack,
> >
> > Thanks for the input, and there are some interesting ideas there.
> >
> > If we were looking to break this into separate donations though I would
> > actually consider 2+3 to be the first piece to incorporate into
> DataFusion
> > because it would provide much better scalability compared to the current
> > model where we eagerly try and execute the entire query tree
> concurrently.
> >
> > I do think having Ballista in the same repo would make it easier to look
> > at pushing certain pieces down into the DataFusion crate rather than
> trying
> > to coordinate this across two projects where only one of them is under
> > Arrow governance.
> >
> > Thanks,
> >
> > Andy.
> >
> > On Thu, Mar 11, 2021 at 12:47 PM Jack Chan <j4ck....@gmail.com> wrote:
> >
> >> Hey Andy
> >>
> >> I want to discuss the areas of Ballista code that you proposed above to
> >> move to Arrow. These are:
> >> 1. serde code for translating between protobuf and
> >> Arrow/DataFusion/Ballista data structures
> >> 2. Distributed query planner
> >> 3. Scheduler process that coordinates query execution across available
> >> executors
> >> 4. Executor process that implements Flight protocol and executes query
> >> partitions and serializes results in Arrow IPC format
> >>
> >> So, 1+4 would make DataFusion an application server that can communicate
> >> through IPC. This is a good thing and easy to maintain. And, 2+3 is the
> >> distributed computing part that is orthogonal to what DataFusion is
> doing.
> >> This is the more architectural and strategic part. Would it make sense
> to
> >> separate the discussion into two? i.e. we can move 1+4 into DataFusion
> >> short-term, and discuss more about 2+3 in longer-term. (This would
> create
> >> some extra work in Ballista. And the only thing I am aware of is to
> >> refactor the executor to not have a hard dependency on scheduler.)
> >>
> >>
> >> Jack
> >>
> >> Andy Grove <andygrov...@gmail.com> 於 2021年3月11日週四 上午9:49寫道：
> >>
> >> > Thanks, Micah.
> >> >
> >> > Regarding integration testing, we currently have an integration test
> >> script
> >> > in the repo that spins up multiple processes in docker compose and
> runs
> >> > through a series of queries on a data set that can be generated
> >> locally. I
> >> > invested in some modest hardware (a refurbed 12 core proliant rack
> >> server
> >> > with 64 GB RAM) to be able to run these tests via CI (using BuildKite)
> >> but
> >> > have not got this set up yet. I am hopeful that with Ballista in
> Apache
> >> > Arrow it will be easier to find companies willing to contribute a more
> >> > scalable solution than this. In the short term, I can at least run
> these
> >> > tests nightly from master and catch regressions quickly.
> >> >
> >> > I agree with your views on tooling / workflow and I am going to step
> up
> >> and
> >> > start working with the Rust community to really dig into this and put
> >> > together some concrete proposals. The conversation does keep coming
> up,
> >> and
> >> > not just here on the mailing list. I am hearing many of the same
> >> concerns
> >> > from current Ballista contributors so there are valid concerns here
> >> that we
> >> > need to address, and I believe that we can address them over time with
> >> some
> >> > incremental improvements, but let's not get into that discussion again
> >> > here. I will follow up hopefully next week with something on this.
> >> >
> >> > On Thu, Mar 11, 2021 at 9:49 AM Micah Kornfield <
> emkornfi...@gmail.com>
> >> > wrote:
> >> >
> >> > > I think having Ballista in Arrow sounds like a good idea in the
> short
> >> > > term.  It sounds like there is enough developer pain, that bringing
> it
> >> > here
> >> > > makes sense (providing existing Ballista contributors are happy with
> >> the
> >> > > change and current Rust maintainers are open to the work involved).
> >> > >
> >> > > One longer term concern is CI.  Setting up a good system for
> >> distributed
> >> > > testing requires a lot of investment and compute resources, but I
> >> think
> >> > we
> >> > > can figure it out when it comes time.  In the short term it seems a
> >> > > mono-repo reduces the engineering effort to get a sane CI system
> >> working.
> >> > >
> >> > > As a point of reference Flink, Beam and Spark all seem to use
> >> mono-repos
> >> > > (their goals are likely a little different then Arrow's though).
> >> > >
> >> > > -Micah
> >> > >
> >> > > P.S.  I do think the tooling/workflow conversation should be
> discussed
> >> > more
> >> > > but I think having a more concrete proposal that first starts from
> >> > > requirements and nice to haves and then gets to a proposed solution
> is
> >> > > important (i.e. pointing out pain points and problems is useful,
> but I
> >> > > think it ignores some of the current value the existing process
> >> > provides).
> >> > >
> >> > > On Wed, Mar 10, 2021 at 5:13 PM Andy Grove <andygrov...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > Thanks for the feedback so far on this proposal. I really
> appreciate
> >> > > > everyone taking the time to put so much thought (and passion!)
> into
> >> > this.
> >> > > >
> >> > > > So far, I don't think anyone is opposed to the idea of donating
> >> > Ballista
> >> > > > but there are clearly concerns about an increased burden on
> current
> >> > > > maintainers.
> >> > > >
> >> > > > We also have re-started discussions around tooling and release
> >> > processes,
> >> > > > but it seems that there is no objection to Rust / DataFusion /
> >> Ballista
> >> > > > having more control over the release process but we have to put in
> >> the
> >> > > work
> >> > > > to make that happen. I am certainly motivated to help with this
> but
> >> I
> >> > > think
> >> > > > that is a separate conversation to donating Ballista.
> >> > > >
> >> > > > To reduce the burden on existing maintainers, we could consider
> >> > initially
> >> > > > adding Ballista in such a way that it doesn't slow down momentum
> on
> >> > > Arrow &
> >> > > > DataFusion by adding it as a separate Rust subproject that is not
> >> part
> >> > of
> >> > > > the Rust workspace, and have it depend on pinned commits
> initially.
> >> > This
> >> > > > would be a lightweight way of incubating the project within the
> >> > mono-repo
> >> > > > and at some point, we can add it to the main workspace. This would
> >> be
> >> > no
> >> > > > worse than the current situation, and it would be better because
> it
> >> is
> >> > at
> >> > > > least under Arrow governance.
> >> > > >
> >> > > > I would like to talk a bit more specifically about the donation at
> >> this
> >> > > > point now that there is some feedback.
> >> > > >
> >> > > > What I propose we donate from Ballista is:
> >> > > >
> >> > > >    -
> >> > > >
> >> > > >    The ballista.proto file that defines an encoding for logical
> and
> >> > > >    physical query plans as well as cluster meta-data (this
> protobuf
> >> > file
> >> > > > could
> >> > > >    eventually be split into separate files for each area)
> >> > > >    -
> >> > > >
> >> > > >    The Rust source code, which consists of these main areas:
> >> > > >    -
> >> > > >
> >> > > >       serde code for translating between protobuf and
> >> > > >       Arrow/DataFusion/Ballista data structures
> >> > > >       -
> >> > > >
> >> > > >       Distributed query planner
> >> > > >       -
> >> > > >
> >> > > >       Scheduler process that coordinates query execution across
> >> > available
> >> > > >       executors
> >> > > >       -
> >> > > >
> >> > > >       Executor process that implements Flight protocol and
> executes
> >> > query
> >> > > >       partitions and serializes results in Arrow IPC format
> >> > > >
> >> > > > I am proposing that we specifically exclude the following parts of
> >> the
> >> > > > Ballista repo from the donation:
> >> > > >
> >> > > >    -
> >> > > >
> >> > > >    The work-in-progress JDBC driver which is not currently
> >> functional
> >> > > >    -
> >> > > >
> >> > > >    The Spark benchmark code that I have been using for comparing
> >> > > > performance
> >> > > >    -
> >> > > >
> >> > > >    The Python bindings, which as far as I know are pretty much a
> >> fork
> >> > of
> >> > > >    Jorge's datafusion-python project.
> >> > > >
> >> > > > I think it is also worth mentioning that Ballista is currently
> only
> >> ~8k
> >> > > > lines of code, which is pretty small in contrast to the >100k
> lines
> >> of
> >> > > code
> >> > > > in the Arrow Rust project currently.
> >> > > >
> >> > > > Let's keep the conversation going and see what other feedback
> there
> >> is
> >> > > > regarding the merits of donating Ballista, or not.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Andy.
> >> > > >
> >> > > > On Wed, Mar 10, 2021 at 3:13 PM Jorge Cardoso Leitão <
> >> > > > jorgecarlei...@gmail.com> wrote:
> >> > > >
> >> > > > > Hi,
> >> > > > >
> >> > > > > Wes, thanks a lot for your reply. Let me try to answer:
> >> > > > >
> >> > > > > 1. If the purpose of Ballista is to support multiple language
> >> > > > > > executors, what does segregating it from the other PL's (where
> >> > > > > > executors are being developed, too) serve to facilitate this
> >> goal?
> >> > > > > >
> >> > > > >
> >> > > > > It facilitates because the stronger the coupling is, the more
> >> > entropic
> >> > > > the
> >> > > > > setup is, and the more energy is required to develop and
> maintain
> >> it.
> >> > > > > In this particular case, I Imagine that each executor would
> >> depend on
> >> > > > > specific versions of each implementation, just like any other
> >> > dependent
> >> > > > > that is not
> >> > > > > maintained by Apache Arrow does.
> >> > > > >
> >> > > > > Or is the idea that every dependent should be on the mono-repo?
> >> If we
> >> > > > need
> >> > > > > to control our dependents like that, that usually indicates that
> >> we
> >> > > can't
> >> > > > > guarantee a stable API (which IMO is the root cause).
> >> > > > >
> >> > > > > 2. Use of the monorepo does not require a synchronized release
> >> cycle,
> >> > > > > > just as Rust does not require it now either. The only reason
> >> there
> >> > > > > > have not been independent Rust releases is because someone has
> >> not
> >> > > > > > volunteered to do it. Likewise, if DataFusion and Ballista are
> >> in
> >> > the
> >> > > > > > same git repository, they don't have to release at the same
> >> time as
> >> > > > > > the core arrow / parquet crates.
> >> > > > > >
> >> > > > >
> >> > > > > I thought that Rust needed to be synchronized with the major
> >> release
> >> > of
> >> > > > the
> >> > > > > repo. Isn't this the case anymore?
> >> > > > >
> >> > > > > 3. On an incremental basis, I do not believe the increased
> >> complexity
> >> > > > > is significant. A multi-repository setup can be actively worse
> >> when
> >> > > > > development work involves both repositories at the same time.
> This
> >> > can
> >> > > > > be mitigated by pinning the arrow / parquet crates as you point
> >> out,
> >> > > > > but that creates other issues.
> >> > > > >
> >> > > > > Could you enumerate parts from DataFusion or Ballista that would
> >> > > require
> >> > > > > work on Arrow at the same time? I proposed that division because
> >> I am
> >> > > > > reasonably confident will not need to be developed at the same
> >> time.
> >> > I
> >> > > am
> >> > > > > confident of this because a) the APIs used by DataFusion are
> >> written
> >> > to
> >> > > > > minimize public surfaces, so that arrow can mutate without
> >> affecting
> >> > > > those
> >> > > > > APIs; b) I designed and implemented most of the DataFusion code
> >> > around
> >> > > > > built-in functions, aggregate functions, UDFs and UDAF.
> >> > > > >
> >> > > > > But maybe we can validate this here: Andy, during the
> development
> >> of
> >> > > > > Ballista, on which the largest changes on Arrow repo were
> needed,
> >> did
> >> > > you
> >> > > > > have to change anything on the Arrow crate or parquet crate, or
> >> was
> >> > > > > everything done on DataFusion? If yes to any, was there a
> >> significant
> >> > > > > burden in doing so?
> >> > > > >
> >> > > > > 4. Even without Jira, there is still the expectation for
> >> contributors
> >> > > > > > to communicate in a way that is compatible with the Apache
> Way.
> >> So
> >> > > > > > even without Jira, PMCs have an obligation to establish an
> >> > > alternative
> >> > > > > > structure to have consistently open dialogue / planning about
> >> what
> >> > > > > > people are working on or planning to work on in the future. If
> >> > > > > > contributors are extensively discussing / planning privately,
> >> these
> >> > > > > > discussions must be moved into the open, whether with design
> >> > > documents
> >> > > > > > or issues or e-mail discussions. This was discussed ad nauseam
> >> in
> >> > the
> >> > > > > > other thread so I won't rehash those arguments.
> >> > > > > >
> >> > > > >
> >> > > > > I fully agree, even though I think it is a bit difficult to
> >> > > > operationalize.
> >> > > > > Thus, let's try like this: would you consider, under the
> >> definition
> >> > > used
> >> > > > > above, discussions happening on github PRs and issues, such as
> >> what
> >> > > > airflow
> >> > > > > does <https://github.com/apache/airflow/issues> , as open?
> >> > > > >
> >> > > > > Aside from these issues, the biggest lost opportunity I see if
> >> > > > > > DF/Baliista "cast away" as it were, is that it becomes
> >> unattractive
> >> > > > > > for the rest of us to build anything on top of these platforms
> >> > > > > > (because at that point we have a circular dependency, which is
> >> the
> >> > > > > > hellscape we escaped from with Parquet C++). I used the
> >> > > > > > datafusion-python project as an example — if that were in the
> >> Arrow
> >> > > > > > project I might consider using it in various ways or
> contribute
> >> to
> >> > > it,
> >> > > > > > but as an external project it's less interesting to me as
> >> something
> >> > > to
> >> > > > > > build on.
> >> > > > > >
> >> > > > >
> >> > > > > My feelings about transferring datafusion-python to arrow are
> >> shared
> >> > > > above:
> >> > > > > I find the idea of picking something that is well encapsulated
> and
> >> > > > > decoupled from the rest and blending it into something large and
> >> less
> >> > > > > decoupled as an entropy-generating activity, which requires more
> >> > energy
> >> > > > to
> >> > > > > maintain. Operationally, the way I would merge a project like
> >> > > > > datafusion-python into Apache would be by transferring ownership
> >> of
> >> > the
> >> > > > > repo on github, transfer ownership of the pypi project, and
> create
> >> > some
> >> > > > > secrets on github to keep twine working. Just like I mentioned
> for
> >> > > > > Ballista. If people lose interest in the project, then
> >> deprecating it
> >> > > > would
> >> > > > > be trivial (archive the repo). If people gain interest in it,
> >> growth
> >> > is
> >> > > > > also trivial (there is already a house in place and the goals
> are
> >> > well
> >> > > > > defined). The interfaces are the API contracts declared as
> pinned
> >> > > > > dependencies (in Cargo.toml / setup.py).
> >> > > > >
> >> > > > > Best,
> >> > > > > Jorge
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Mar 10, 2021 at 7:50 PM Wes McKinney <
> wesmck...@gmail.com
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > hi Jorge,
> >> > > > > >
> >> > > > > > I have some thoughts / questions on your arguments against use
> >> of
> >> > the
> >> > > > > > monorepo:
> >> > > > > >
> >> > > > > > 1. If the purpose of Ballista is to support multiple language
> >> > > > > > executors, what does segregating it from the other PL's (where
> >> > > > > > executors are being developed, too) serve to facilitate this
> >> goal?
> >> > > > > >
> >> > > > > > 2. Use of the monorepo does not require a synchronized release
> >> > cycle,
> >> > > > > > just as Rust does not require it now either. The only reason
> >> there
> >> > > > > > have not been independent Rust releases is because someone has
> >> not
> >> > > > > > volunteered to do it. Likewise, if DataFusion and Ballista are
> >> in
> >> > the
> >> > > > > > same git repository, they don't have to release at the same
> >> time as
> >> > > > > > the core arrow / parquet crates.
> >> > > > > >
> >> > > > > > 3. On an incremental basis, I do not believe the increased
> >> > complexity
> >> > > > > > is significant. A multi-repository setup can be actively worse
> >> when
> >> > > > > > development work involves both repositories at the same time.
> >> This
> >> > > can
> >> > > > > > be mitigated by pinning the arrow / parquet crates as you
> point
> >> > out,
> >> > > > > > but that creates other issues.
> >> > > > > >
> >> > > > > > 4. Even without Jira, there is still the expectation for
> >> > contributors
> >> > > > > > to communicate in a way that is compatible with the Apache
> Way.
> >> So
> >> > > > > > even without Jira, PMCs have an obligation to establish an
> >> > > alternative
> >> > > > > > structure to have consistently open dialogue / planning about
> >> what
> >> > > > > > people are working on or planning to work on in the future. If
> >> > > > > > contributors are extensively discussing / planning privately,
> >> these
> >> > > > > > discussions must be moved into the open, whether with design
> >> > > documents
> >> > > > > > or issues or e-mail discussions. This was discussed ad nauseam
> >> in
> >> > the
> >> > > > > > other thread so I won't rehash those arguments.
> >> > > > > >
> >> > > > > > Aside from these issues, the biggest lost opportunity I see if
> >> > > > > > DF/Baliista "cast away" as it were, is that it becomes
> >> unattractive
> >> > > > > > for the rest of us to build anything on top of these platforms
> >> > > > > > (because at that point we have a circular dependency, which is
> >> the
> >> > > > > > hellscape we escaped from with Parquet C++). I used the
> >> > > > > > datafusion-python project as an example — if that were in the
> >> Arrow
> >> > > > > > project I might consider using it in various ways or
> contribute
> >> to
> >> > > it,
> >> > > > > > but as an external project it's less interesting to me as
> >> something
> >> > > to
> >> > > > > > build on.
> >> > > > > >
> >> > > > > > On Wed, Mar 10, 2021 at 12:13 PM Jorge Cardoso Leitão
> >> > > > > > <jorgecarlei...@gmail.com> wrote:
> >> > > > > > >
> >> > > > > > > Hi,
> >> > > > > > >
> >> > > > > > > First of all, I want to thank you very much for your work on
> >> > > Ballista
> >> > > > > and
> >> > > > > > > for doing it in an open source environment. It is something
> >> that
> >> > > > should
> >> > > > > > be
> >> > > > > > > emphasised and celebrated.
> >> > > > > > >
> >> > > > > > > Secondly, wrt to considering donating it to the Apache
> >> Foundation
> >> > > and
> >> > > > > > > Apache project in particular, I would say that we should be
> >> > honored
> >> > > > by
> >> > > > > > such
> >> > > > > > > consideration. In this context, my immediate reaction is:
> how
> >> can
> >> > > we
> >> > > > > best
> >> > > > > > > support Ballista's community?
> >> > > > > > >
> >> > > > > > > My initial thoughts in this direction are:
> >> > > > > > >
> >> > > > > > > * create a new git repo for DataFusion and Ballista to
> reside
> >> on
> >> > > > (e.g.
> >> > > > > > > arrow/ballista)
> >> > > > > > > * do not require the release cycle and versioning to be
> >> aligned
> >> > > with
> >> > > > > > > arrow's release cycle
> >> > > > > > > * do not require the usage of JIRA
> >> > > > > > > * pin the dependency of Datafusion on Arrow and parquet
> crate
> >> > (e.g.
> >> > > > to
> >> > > > > a
> >> > > > > > > specific commit)
> >> > > > > > >
> >> > > > > > > I feel that this setup would keep Ballista under the
> >> Foundation
> >> > and
> >> > > > > > Apache
> >> > > > > > > Arrow's umbrella and aligned with its goals, while at the
> same
> >> > time
> >> > > > put
> >> > > > > > the
> >> > > > > > > least amount of burden on its community, both in terms of
> >> > keeping a
> >> > > > > > strict
> >> > > > > > > release schedule, tooling and CI.
> >> > > > > > >
> >> > > > > > > The rationale for the above is that whenever something is
> >> > released
> >> > > on
> >> > > > > > > DataFusion (which hosts most of the physical ops), people
> will
> >> > also
> >> > > > > want
> >> > > > > > it
> >> > > > > > > quickly available on Ballista. Thus, having the two release
> >> > cycles
> >> > > > more
> >> > > > > > > closely related and independent of the arrow
> implementation's
> >> > cycle
> >> > > > is
> >> > > > > > > good. DataFusion does not have integration tests against
> other
> >> > > arrow
> >> > > > > > > implementations, and thus the integration tests are not
> >> relevant.
> >> > > > > > >
> >> > > > > > > There are 4 main reasons I would not recommend placing it in
> >> the
> >> > > > > > mono-repo:
> >> > > > > > >
> >> > > > > > > 1. It would not add much
> >> > > > > > > 2. It would place Ballista on the same release schedule and
> >> git
> >> > > > system
> >> > > > > as
> >> > > > > > > the rest of Arrow's implementation, which may not suit
> >> Ballista's
> >> > > own
> >> > > > > > > development pace (in either direction)
> >> > > > > > > 3. It further increases the complexity of the current repo
> >> > > > > > > 4. It would force its community to use JIRA, merge process,
> >> > > > components,
> >> > > > > > > etc, which may not be what its community wishes for
> >> > > > > > >
> >> > > > > > > The main risk I see is that because arrow's release cycle is
> >> slow
> >> > > and
> >> > > > > > major
> >> > > > > > > releases only, DataFusion risks missing arrow features from
> >> time
> >> > to
> >> > > > > time.
> >> > > > > > > We can mitigate this with cargo and pins to commit hashes.
> IMO
> >> > this
> >> > > > > risk
> >> > > > > > > exists in any dependency relationship and is usually a sign
> >> that
> >> > > > there
> >> > > > > is
> >> > > > > > > an API contract and thus a trust relationship involved,
> which
> >> is
> >> > a
> >> > > > good
> >> > > > > > > thing.
> >> > > > > > >
> >> > > > > > > Best,
> >> > > > > > > Jorge
> >> > > > > > >
> >> > > > > > > On Tue, Mar 9, 2021 at 6:31 PM Andy Grove <
> >> andygrov...@gmail.com
> >> > >
> >> > > > > wrote:
> >> > > > > > >
> >> > > > > > > > As many of you know, the reason that I got involved in
> Arrow
> >> > back
> >> > > > in
> >> > > > > > 2018
> >> > > > > > > > was that I wanted to build a distributed compute platform
> in
> >> > > Rust,
> >> > > > > with
> >> > > > > > > > capabilities similar to Apache Spark. This led to the
> >> creation
> >> > of
> >> > > > the
> >> > > > > > > > DataFusion query engine, which is an in-memory query
> engine
> >> and
> >> > > is
> >> > > > > now
> >> > > > > > part
> >> > > > > > > > of the Arrow repo.
> >> > > > > > > >
> >> > > > > > > > Over the past couple of years, I have been working outside
> >> of
> >> > > Arrow
> >> > > > > on
> >> > > > > > a
> >> > > > > > > > project named “Ballista” [1] to continue the journey of
> >> trying
> >> > to
> >> > > > > > build a
> >> > > > > > > > distributed version. Due to the pandemic, I have had time
> >> over
> >> > > the
> >> > > > > > winter
> >> > > > > > > > to put more effort into this project and have managed to
> >> build
> >> > a
> >> > > > > small
> >> > > > > > > > community around it over the past few months and the
> project
> >> > has
> >> > > > now
> >> > > > > > > > reached a point where the basic architecture has been
> proven
> >> > and
> >> > > it
> >> > > > > is
> >> > > > > > now
> >> > > > > > > > getting a lot of attention (more than 2k stars on GitHub
> >> just
> >> > > > > > recently) and
> >> > > > > > > > I think that it would now make sense to donate some or all
> >> of
> >> > the
> >> > > > > > project
> >> > > > > > > > to Apache Arrow and continue its growth here.
> >> > > > > > > >
> >> > > > > > > > For an overview of the project, please see the talk I
> >> recently
> >> > > gave
> >> > > > > at
> >> > > > > > the
> >> > > > > > > > New York Open Statistical Programming Meetup [2].
> >> > > > > > > >
> >> > > > > > > > Some of the benefits that I see in donating the project to
> >> > Arrow
> >> > > > are:
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >    -
> >> > > > > > > >
> >> > > > > > > >    DataFusion also needs a scheduler and it would probably
> >> make
> >> > > > sense
> >> > > > > > to
> >> > > > > > > >    push some parts of the Ballista scheduler down a level
> in
> >> > the
> >> > > > > stack
> >> > > > > > so
> >> > > > > > > > that
> >> > > > > > > >    the same approach is used to scale across cores in
> >> > DataFusion
> >> > > > and
> >> > > > > to
> >> > > > > > > > scale
> >> > > > > > > >    across nodes in Ballista.
> >> > > > > > > >    -
> >> > > > > > > >
> >> > > > > > > >    Ballista provides preliminary support for spill-to-disk
> >> > > > > > functionality
> >> > > > > > > >    (in Arrow IPC format) which could also benefit
> DataFusion
> >> > and
> >> > > > > > provide
> >> > > > > > > >    better scalability through out-of-core processing.
> >> > > > > > > >    -
> >> > > > > > > >
> >> > > > > > > >    Although the Ballista scheduler is implemented in Rust,
> >> it
> >> > is
> >> > > > > > possible
> >> > > > > > > >    to implement executors in other languages due to the
> use
> >> of
> >> > > > > Flight,
> >> > > > > > > > gRPC,
> >> > > > > > > >    and protobuf, so this may be of interest to other
> >> language
> >> > > > > > > > implementations
> >> > > > > > > >    of Arrow as well.
> >> > > > > > > >    -
> >> > > > > > > >
> >> > > > > > > >    There is already some overlap between Arrow and
> Ballista
> >> > > > > > contributors.
> >> > > > > > > >    -
> >> > > > > > > >
> >> > > > > > > >    Ballista unit tests will be part of Arrow CI which
> means
> >> > that
> >> > > > any
> >> > > > > > > >    changes to Arrow or DataFusion APIs that Ballista
> >> depends on
> >> > > > will
> >> > > > > > also
> >> > > > > > > >    require that the corresponding Ballista code is updated
> >> as
> >> > > part
> >> > > > of
> >> > > > > > the
> >> > > > > > > > same
> >> > > > > > > >    PR.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > My main goal with this email thread is to gauge interest
> in
> >> > > > donating
> >> > > > > > this
> >> > > > > > > > code. If there is interest in doing so then we can have a
> >> more
> >> > > > > detailed
> >> > > > > > > > follow-up conversation on exactly what would be donated
> and
> >> > where
> >> > > > it
> >> > > > > > would
> >> > > > > > > > go.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > I have also filed a GitHub issue in Ballista to get
> feedback
> >> > from
> >> > > > > > current
> >> > > > > > > > contributors [3].
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > I'm looking forward to hearing opinions on this!
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Thanks,
> >> > > > > > > >
> >> > > > > > > > Andy.
> >> > > > > > > >
> >> > > > > > > > [1] https://github.com/ballista-compute/ballista
> >> > > > > > > >
> >> > > > > > > > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
> >> > > > > > > >
> >> > > > > > > > [3]
> https://github.com/ballista-compute/ballista/issues/646
> >> > > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

Reply via email to