As many of you know, the reason that I got involved in Arrow back in 2018
was that I wanted to build a distributed compute platform in Rust, with
capabilities similar to Apache Spark. This led to the creation of the
DataFusion query engine, which is an in-memory query engine and is now part
of the Arrow repo.

Over the past couple of years, I have been working outside of Arrow on a
project named “Ballista” [1] to continue the journey of trying to build a
distributed version. Due to the pandemic, I have had time over the winter
to put more effort into this project and have managed to build a small
community around it over the past few months and the project has now
reached a point where the basic architecture has been proven and it is now
getting a lot of attention (more than 2k stars on GitHub just recently) and
I think that it would now make sense to donate some or all of the project
to Apache Arrow and continue its growth here.

For an overview of the project, please see the talk I recently gave at the
New York Open Statistical Programming Meetup [2].

Some of the benefits that I see in donating the project to Arrow are:


   -

   DataFusion also needs a scheduler and it would probably make sense to
   push some parts of the Ballista scheduler down a level in the stack so that
   the same approach is used to scale across cores in DataFusion and to scale
   across nodes in Ballista.
   -

   Ballista provides preliminary support for spill-to-disk functionality
   (in Arrow IPC format) which could also benefit DataFusion and provide
   better scalability through out-of-core processing.
   -

   Although the Ballista scheduler is implemented in Rust, it is possible
   to implement executors in other languages due to the use of Flight, gRPC,
   and protobuf, so this may be of interest to other language implementations
   of Arrow as well.
   -

   There is already some overlap between Arrow and Ballista contributors.
   -

   Ballista unit tests will be part of Arrow CI which means that any
   changes to Arrow or DataFusion APIs that Ballista depends on will also
   require that the corresponding Ballista code is updated as part of the same
   PR.


My main goal with this email thread is to gauge interest in donating this
code. If there is interest in doing so then we can have a more detailed
follow-up conversation on exactly what would be donated and where it would
go.


I have also filed a GitHub issue in Ballista to get feedback from current
contributors [3].


I'm looking forward to hearing opinions on this!


Thanks,

Andy.

[1] https://github.com/ballista-compute/ballista

[2] https://www.youtube.com/watch?v=ZZHQaOap9pQ

[3] https://github.com/ballista-compute/ballista/issues/646

Reply via email to