Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

Jorge Cardoso Leitão Wed, 10 Mar 2021 10:13:45 -0800

Hi,

First of all, I want to thank you very much for your work on Ballista and
for doing it in an open source environment. It is something that should be
emphasised and celebrated.

Secondly, wrt to considering donating it to the Apache Foundation and
Apache project in particular, I would say that we should be honored by such
consideration. In this context, my immediate reaction is: how can we best
support Ballista's community?

My initial thoughts in this direction are:

* create a new git repo for DataFusion and Ballista to reside on (e.g.
arrow/ballista)
* do not require the release cycle and versioning to be aligned with
arrow's release cycle
* do not require the usage of JIRA
* pin the dependency of Datafusion on Arrow and parquet crate (e.g. to a
specific commit)

I feel that this setup would keep Ballista under the Foundation and Apache
Arrow's umbrella and aligned with its goals, while at the same time put the
least amount of burden on its community, both in terms of keeping a strict
release schedule, tooling and CI.

The rationale for the above is that whenever something is released on
DataFusion (which hosts most of the physical ops), people will also want it
quickly available on Ballista. Thus, having the two release cycles more
closely related and independent of the arrow implementation's cycle is
good. DataFusion does not have integration tests against other arrow
implementations, and thus the integration tests are not relevant.

There are 4 main reasons I would not recommend placing it in the mono-repo:

1. It would not add much
2. It would place Ballista on the same release schedule and git system as
the rest of Arrow's implementation, which may not suit Ballista's own
development pace (in either direction)
3. It further increases the complexity of the current repo
4. It would force its community to use JIRA, merge process, components,
etc, which may not be what its community wishes for

The main risk I see is that because arrow's release cycle is slow and major
releases only, DataFusion risks missing arrow features from time to time.
We can mitigate this with cargo and pins to commit hashes. IMO this risk
exists in any dependency relationship and is usually a sign that there is
an API contract and thus a trust relationship involved, which is a good
thing.

Best,
Jorge

On Tue, Mar 9, 2021 at 6:31 PM Andy Grove <[email protected]> wrote:

> As many of you know, the reason that I got involved in Arrow back in 2018
> was that I wanted to build a distributed compute platform in Rust, with
> capabilities similar to Apache Spark. This led to the creation of the
> DataFusion query engine, which is an in-memory query engine and is now part
> of the Arrow repo.
>
> Over the past couple of years, I have been working outside of Arrow on a
> project named “Ballista” [1] to continue the journey of trying to build a
> distributed version. Due to the pandemic, I have had time over the winter
> to put more effort into this project and have managed to build a small
> community around it over the past few months and the project has now
> reached a point where the basic architecture has been proven and it is now
> getting a lot of attention (more than 2k stars on GitHub just recently) and
> I think that it would now make sense to donate some or all of the project
> to Apache Arrow and continue its growth here.
>
> For an overview of the project, please see the talk I recently gave at the
> New York Open Statistical Programming Meetup [2].
>
> Some of the benefits that I see in donating the project to Arrow are:
>
>
>    -
>
>    DataFusion also needs a scheduler and it would probably make sense to
>    push some parts of the Ballista scheduler down a level in the stack so
> that
>    the same approach is used to scale across cores in DataFusion and to
> scale
>    across nodes in Ballista.
>    -
>
>    Ballista provides preliminary support for spill-to-disk functionality
>    (in Arrow IPC format) which could also benefit DataFusion and provide
>    better scalability through out-of-core processing.
>    -
>
>    Although the Ballista scheduler is implemented in Rust, it is possible
>    to implement executors in other languages due to the use of Flight,
> gRPC,
>    and protobuf, so this may be of interest to other language
> implementations
>    of Arrow as well.
>    -
>
>    There is already some overlap between Arrow and Ballista contributors.
>    -
>
>    Ballista unit tests will be part of Arrow CI which means that any
>    changes to Arrow or DataFusion APIs that Ballista depends on will also
>    require that the corresponding Ballista code is updated as part of the
> same
>    PR.
>
>
> My main goal with this email thread is to gauge interest in donating this
> code. If there is interest in doing so then we can have a more detailed
> follow-up conversation on exactly what would be donated and where it would
> go.
>
>
> I have also filed a GitHub issue in Ballista to get feedback from current
> contributors [3].
>
>
> I'm looking forward to hearing opinions on this!
>
>
> Thanks,
>
> Andy.
>
> [1] https://github.com/ballista-compute/ballista
>
> [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ
>
> [3] https://github.com/ballista-compute/ballista/issues/646
>

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

Reply via email to