Hi, I agree.
I skimmed through the code and it is a solid addition to me, so, again, thank you so much for donating it to Arrow. I look very much forward to learning more on how to build a scheduler that can adapt to both in-node and across-node executions :-) Best, Jorge On Tue, Mar 16, 2021 at 3:21 PM Andy Grove <andygrov...@gmail.com> wrote: > Thank you for all the responses so far. Based on this thread and the > conversations happening in the Ballista project, I would say that the > feedback is mostly positive and supportive of this donation, so I have > started work on a PR [1] and will start a VOTE email thread once the PR is > ready for review. Assuming that the vote passes, we will need to go through > the IP clearance process. > > To summarize the feedback so far (at least, my interpretation of it): > > - > > There is consensus that it makes sense for DataFusion and Ballista to > live in the same repository so that we can keep them tightly coupled and > have queries scale across cores in DataFusion and across nodes in > Ballista > using the same scheduler. > - > > There is a desire to have a separate release process for DataFusion and > Ballista that is not aligned with core Arrow releases, and it appears > that > there is no reason we cannot achieve this with Ballista living in the > Arrow > repo. > - > > There is a desire to have DataFusion and Ballista eventually live in a > new repo, separate from core Arrow. There is no objection to still being > under Arrow governance. If this donation goes through, I expect we will > be > discussing this point again at some point in the future, when Ballista > is > more mature. > - > > There is some concern that this will add an additional burden to current > Arrow maintainers. I think we can reduce this burden by not requiring > that > Ballista depend on the HEAD version of DataFusion/Arrow i.e. we start > doing > some dependency management within the Rust project. > - > > There is some concern that being part of Arrow will make Ballista less > attractive to contribute to due to the perceived "bureaucracy" of the > ASF > process, such as requiring JIRA tickets to be filed, and the (current) > infrequent release cycles. I think this concern can be reduced over > time. > There were similar concerns when DataFusion was donated and that project > seems to be thriving. > > Thanks, > > Andy. > > [1] https://github.com/apache/arrow/pull/9723 > > On Thu, Mar 11, 2021 at 1:39 PM Andy Grove <andygrov...@gmail.com> wrote: > > > Hi Jack, > > > > Thanks for the input, and there are some interesting ideas there. > > > > If we were looking to break this into separate donations though I would > > actually consider 2+3 to be the first piece to incorporate into > DataFusion > > because it would provide much better scalability compared to the current > > model where we eagerly try and execute the entire query tree > concurrently. > > > > I do think having Ballista in the same repo would make it easier to look > > at pushing certain pieces down into the DataFusion crate rather than > trying > > to coordinate this across two projects where only one of them is under > > Arrow governance. > > > > Thanks, > > > > Andy. > > > > On Thu, Mar 11, 2021 at 12:47 PM Jack Chan <j4ck....@gmail.com> wrote: > > > >> Hey Andy > >> > >> I want to discuss the areas of Ballista code that you proposed above to > >> move to Arrow. These are: > >> 1. serde code for translating between protobuf and > >> Arrow/DataFusion/Ballista data structures > >> 2. Distributed query planner > >> 3. Scheduler process that coordinates query execution across available > >> executors > >> 4. Executor process that implements Flight protocol and executes query > >> partitions and serializes results in Arrow IPC format > >> > >> So, 1+4 would make DataFusion an application server that can communicate > >> through IPC. This is a good thing and easy to maintain. And, 2+3 is the > >> distributed computing part that is orthogonal to what DataFusion is > doing. > >> This is the more architectural and strategic part. Would it make sense > to > >> separate the discussion into two? i.e. we can move 1+4 into DataFusion > >> short-term, and discuss more about 2+3 in longer-term. (This would > create > >> some extra work in Ballista. And the only thing I am aware of is to > >> refactor the executor to not have a hard dependency on scheduler.) > >> > >> > >> Jack > >> > >> Andy Grove <andygrov...@gmail.com> 於 2021年3月11日週四 上午9:49寫道: > >> > >> > Thanks, Micah. > >> > > >> > Regarding integration testing, we currently have an integration test > >> script > >> > in the repo that spins up multiple processes in docker compose and > runs > >> > through a series of queries on a data set that can be generated > >> locally. I > >> > invested in some modest hardware (a refurbed 12 core proliant rack > >> server > >> > with 64 GB RAM) to be able to run these tests via CI (using BuildKite) > >> but > >> > have not got this set up yet. I am hopeful that with Ballista in > Apache > >> > Arrow it will be easier to find companies willing to contribute a more > >> > scalable solution than this. In the short term, I can at least run > these > >> > tests nightly from master and catch regressions quickly. > >> > > >> > I agree with your views on tooling / workflow and I am going to step > up > >> and > >> > start working with the Rust community to really dig into this and put > >> > together some concrete proposals. The conversation does keep coming > up, > >> and > >> > not just here on the mailing list. I am hearing many of the same > >> concerns > >> > from current Ballista contributors so there are valid concerns here > >> that we > >> > need to address, and I believe that we can address them over time with > >> some > >> > incremental improvements, but let's not get into that discussion again > >> > here. I will follow up hopefully next week with something on this. > >> > > >> > On Thu, Mar 11, 2021 at 9:49 AM Micah Kornfield < > emkornfi...@gmail.com> > >> > wrote: > >> > > >> > > I think having Ballista in Arrow sounds like a good idea in the > short > >> > > term. It sounds like there is enough developer pain, that bringing > it > >> > here > >> > > makes sense (providing existing Ballista contributors are happy with > >> the > >> > > change and current Rust maintainers are open to the work involved). > >> > > > >> > > One longer term concern is CI. Setting up a good system for > >> distributed > >> > > testing requires a lot of investment and compute resources, but I > >> think > >> > we > >> > > can figure it out when it comes time. In the short term it seems a > >> > > mono-repo reduces the engineering effort to get a sane CI system > >> working. > >> > > > >> > > As a point of reference Flink, Beam and Spark all seem to use > >> mono-repos > >> > > (their goals are likely a little different then Arrow's though). > >> > > > >> > > -Micah > >> > > > >> > > P.S. I do think the tooling/workflow conversation should be > discussed > >> > more > >> > > but I think having a more concrete proposal that first starts from > >> > > requirements and nice to haves and then gets to a proposed solution > is > >> > > important (i.e. pointing out pain points and problems is useful, > but I > >> > > think it ignores some of the current value the existing process > >> > provides). > >> > > > >> > > On Wed, Mar 10, 2021 at 5:13 PM Andy Grove <andygrov...@gmail.com> > >> > wrote: > >> > > > >> > > > Thanks for the feedback so far on this proposal. I really > appreciate > >> > > > everyone taking the time to put so much thought (and passion!) > into > >> > this. > >> > > > > >> > > > So far, I don't think anyone is opposed to the idea of donating > >> > Ballista > >> > > > but there are clearly concerns about an increased burden on > current > >> > > > maintainers. > >> > > > > >> > > > We also have re-started discussions around tooling and release > >> > processes, > >> > > > but it seems that there is no objection to Rust / DataFusion / > >> Ballista > >> > > > having more control over the release process but we have to put in > >> the > >> > > work > >> > > > to make that happen. I am certainly motivated to help with this > but > >> I > >> > > think > >> > > > that is a separate conversation to donating Ballista. > >> > > > > >> > > > To reduce the burden on existing maintainers, we could consider > >> > initially > >> > > > adding Ballista in such a way that it doesn't slow down momentum > on > >> > > Arrow & > >> > > > DataFusion by adding it as a separate Rust subproject that is not > >> part > >> > of > >> > > > the Rust workspace, and have it depend on pinned commits > initially. > >> > This > >> > > > would be a lightweight way of incubating the project within the > >> > mono-repo > >> > > > and at some point, we can add it to the main workspace. This would > >> be > >> > no > >> > > > worse than the current situation, and it would be better because > it > >> is > >> > at > >> > > > least under Arrow governance. > >> > > > > >> > > > I would like to talk a bit more specifically about the donation at > >> this > >> > > > point now that there is some feedback. > >> > > > > >> > > > What I propose we donate from Ballista is: > >> > > > > >> > > > - > >> > > > > >> > > > The ballista.proto file that defines an encoding for logical > and > >> > > > physical query plans as well as cluster meta-data (this > protobuf > >> > file > >> > > > could > >> > > > eventually be split into separate files for each area) > >> > > > - > >> > > > > >> > > > The Rust source code, which consists of these main areas: > >> > > > - > >> > > > > >> > > > serde code for translating between protobuf and > >> > > > Arrow/DataFusion/Ballista data structures > >> > > > - > >> > > > > >> > > > Distributed query planner > >> > > > - > >> > > > > >> > > > Scheduler process that coordinates query execution across > >> > available > >> > > > executors > >> > > > - > >> > > > > >> > > > Executor process that implements Flight protocol and > executes > >> > query > >> > > > partitions and serializes results in Arrow IPC format > >> > > > > >> > > > I am proposing that we specifically exclude the following parts of > >> the > >> > > > Ballista repo from the donation: > >> > > > > >> > > > - > >> > > > > >> > > > The work-in-progress JDBC driver which is not currently > >> functional > >> > > > - > >> > > > > >> > > > The Spark benchmark code that I have been using for comparing > >> > > > performance > >> > > > - > >> > > > > >> > > > The Python bindings, which as far as I know are pretty much a > >> fork > >> > of > >> > > > Jorge's datafusion-python project. > >> > > > > >> > > > I think it is also worth mentioning that Ballista is currently > only > >> ~8k > >> > > > lines of code, which is pretty small in contrast to the >100k > lines > >> of > >> > > code > >> > > > in the Arrow Rust project currently. > >> > > > > >> > > > Let's keep the conversation going and see what other feedback > there > >> is > >> > > > regarding the merits of donating Ballista, or not. > >> > > > > >> > > > Thanks, > >> > > > > >> > > > Andy. > >> > > > > >> > > > On Wed, Mar 10, 2021 at 3:13 PM Jorge Cardoso Leitão < > >> > > > jorgecarlei...@gmail.com> wrote: > >> > > > > >> > > > > Hi, > >> > > > > > >> > > > > Wes, thanks a lot for your reply. Let me try to answer: > >> > > > > > >> > > > > 1. If the purpose of Ballista is to support multiple language > >> > > > > > executors, what does segregating it from the other PL's (where > >> > > > > > executors are being developed, too) serve to facilitate this > >> goal? > >> > > > > > > >> > > > > > >> > > > > It facilitates because the stronger the coupling is, the more > >> > entropic > >> > > > the > >> > > > > setup is, and the more energy is required to develop and > maintain > >> it. > >> > > > > In this particular case, I Imagine that each executor would > >> depend on > >> > > > > specific versions of each implementation, just like any other > >> > dependent > >> > > > > that is not > >> > > > > maintained by Apache Arrow does. > >> > > > > > >> > > > > Or is the idea that every dependent should be on the mono-repo? > >> If we > >> > > > need > >> > > > > to control our dependents like that, that usually indicates that > >> we > >> > > can't > >> > > > > guarantee a stable API (which IMO is the root cause). > >> > > > > > >> > > > > 2. Use of the monorepo does not require a synchronized release > >> cycle, > >> > > > > > just as Rust does not require it now either. The only reason > >> there > >> > > > > > have not been independent Rust releases is because someone has > >> not > >> > > > > > volunteered to do it. Likewise, if DataFusion and Ballista are > >> in > >> > the > >> > > > > > same git repository, they don't have to release at the same > >> time as > >> > > > > > the core arrow / parquet crates. > >> > > > > > > >> > > > > > >> > > > > I thought that Rust needed to be synchronized with the major > >> release > >> > of > >> > > > the > >> > > > > repo. Isn't this the case anymore? > >> > > > > > >> > > > > 3. On an incremental basis, I do not believe the increased > >> complexity > >> > > > > is significant. A multi-repository setup can be actively worse > >> when > >> > > > > development work involves both repositories at the same time. > This > >> > can > >> > > > > be mitigated by pinning the arrow / parquet crates as you point > >> out, > >> > > > > but that creates other issues. > >> > > > > > >> > > > > Could you enumerate parts from DataFusion or Ballista that would > >> > > require > >> > > > > work on Arrow at the same time? I proposed that division because > >> I am > >> > > > > reasonably confident will not need to be developed at the same > >> time. > >> > I > >> > > am > >> > > > > confident of this because a) the APIs used by DataFusion are > >> written > >> > to > >> > > > > minimize public surfaces, so that arrow can mutate without > >> affecting > >> > > > those > >> > > > > APIs; b) I designed and implemented most of the DataFusion code > >> > around > >> > > > > built-in functions, aggregate functions, UDFs and UDAF. > >> > > > > > >> > > > > But maybe we can validate this here: Andy, during the > development > >> of > >> > > > > Ballista, on which the largest changes on Arrow repo were > needed, > >> did > >> > > you > >> > > > > have to change anything on the Arrow crate or parquet crate, or > >> was > >> > > > > everything done on DataFusion? If yes to any, was there a > >> significant > >> > > > > burden in doing so? > >> > > > > > >> > > > > 4. Even without Jira, there is still the expectation for > >> contributors > >> > > > > > to communicate in a way that is compatible with the Apache > Way. > >> So > >> > > > > > even without Jira, PMCs have an obligation to establish an > >> > > alternative > >> > > > > > structure to have consistently open dialogue / planning about > >> what > >> > > > > > people are working on or planning to work on in the future. If > >> > > > > > contributors are extensively discussing / planning privately, > >> these > >> > > > > > discussions must be moved into the open, whether with design > >> > > documents > >> > > > > > or issues or e-mail discussions. This was discussed ad nauseam > >> in > >> > the > >> > > > > > other thread so I won't rehash those arguments. > >> > > > > > > >> > > > > > >> > > > > I fully agree, even though I think it is a bit difficult to > >> > > > operationalize. > >> > > > > Thus, let's try like this: would you consider, under the > >> definition > >> > > used > >> > > > > above, discussions happening on github PRs and issues, such as > >> what > >> > > > airflow > >> > > > > does <https://github.com/apache/airflow/issues> , as open? > >> > > > > > >> > > > > Aside from these issues, the biggest lost opportunity I see if > >> > > > > > DF/Baliista "cast away" as it were, is that it becomes > >> unattractive > >> > > > > > for the rest of us to build anything on top of these platforms > >> > > > > > (because at that point we have a circular dependency, which is > >> the > >> > > > > > hellscape we escaped from with Parquet C++). I used the > >> > > > > > datafusion-python project as an example — if that were in the > >> Arrow > >> > > > > > project I might consider using it in various ways or > contribute > >> to > >> > > it, > >> > > > > > but as an external project it's less interesting to me as > >> something > >> > > to > >> > > > > > build on. > >> > > > > > > >> > > > > > >> > > > > My feelings about transferring datafusion-python to arrow are > >> shared > >> > > > above: > >> > > > > I find the idea of picking something that is well encapsulated > and > >> > > > > decoupled from the rest and blending it into something large and > >> less > >> > > > > decoupled as an entropy-generating activity, which requires more > >> > energy > >> > > > to > >> > > > > maintain. Operationally, the way I would merge a project like > >> > > > > datafusion-python into Apache would be by transferring ownership > >> of > >> > the > >> > > > > repo on github, transfer ownership of the pypi project, and > create > >> > some > >> > > > > secrets on github to keep twine working. Just like I mentioned > for > >> > > > > Ballista. If people lose interest in the project, then > >> deprecating it > >> > > > would > >> > > > > be trivial (archive the repo). If people gain interest in it, > >> growth > >> > is > >> > > > > also trivial (there is already a house in place and the goals > are > >> > well > >> > > > > defined). The interfaces are the API contracts declared as > pinned > >> > > > > dependencies (in Cargo.toml / setup.py). > >> > > > > > >> > > > > Best, > >> > > > > Jorge > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > On Wed, Mar 10, 2021 at 7:50 PM Wes McKinney < > wesmck...@gmail.com > >> > > >> > > > wrote: > >> > > > > > >> > > > > > hi Jorge, > >> > > > > > > >> > > > > > I have some thoughts / questions on your arguments against use > >> of > >> > the > >> > > > > > monorepo: > >> > > > > > > >> > > > > > 1. If the purpose of Ballista is to support multiple language > >> > > > > > executors, what does segregating it from the other PL's (where > >> > > > > > executors are being developed, too) serve to facilitate this > >> goal? > >> > > > > > > >> > > > > > 2. Use of the monorepo does not require a synchronized release > >> > cycle, > >> > > > > > just as Rust does not require it now either. The only reason > >> there > >> > > > > > have not been independent Rust releases is because someone has > >> not > >> > > > > > volunteered to do it. Likewise, if DataFusion and Ballista are > >> in > >> > the > >> > > > > > same git repository, they don't have to release at the same > >> time as > >> > > > > > the core arrow / parquet crates. > >> > > > > > > >> > > > > > 3. On an incremental basis, I do not believe the increased > >> > complexity > >> > > > > > is significant. A multi-repository setup can be actively worse > >> when > >> > > > > > development work involves both repositories at the same time. > >> This > >> > > can > >> > > > > > be mitigated by pinning the arrow / parquet crates as you > point > >> > out, > >> > > > > > but that creates other issues. > >> > > > > > > >> > > > > > 4. Even without Jira, there is still the expectation for > >> > contributors > >> > > > > > to communicate in a way that is compatible with the Apache > Way. > >> So > >> > > > > > even without Jira, PMCs have an obligation to establish an > >> > > alternative > >> > > > > > structure to have consistently open dialogue / planning about > >> what > >> > > > > > people are working on or planning to work on in the future. If > >> > > > > > contributors are extensively discussing / planning privately, > >> these > >> > > > > > discussions must be moved into the open, whether with design > >> > > documents > >> > > > > > or issues or e-mail discussions. This was discussed ad nauseam > >> in > >> > the > >> > > > > > other thread so I won't rehash those arguments. > >> > > > > > > >> > > > > > Aside from these issues, the biggest lost opportunity I see if > >> > > > > > DF/Baliista "cast away" as it were, is that it becomes > >> unattractive > >> > > > > > for the rest of us to build anything on top of these platforms > >> > > > > > (because at that point we have a circular dependency, which is > >> the > >> > > > > > hellscape we escaped from with Parquet C++). I used the > >> > > > > > datafusion-python project as an example — if that were in the > >> Arrow > >> > > > > > project I might consider using it in various ways or > contribute > >> to > >> > > it, > >> > > > > > but as an external project it's less interesting to me as > >> something > >> > > to > >> > > > > > build on. > >> > > > > > > >> > > > > > On Wed, Mar 10, 2021 at 12:13 PM Jorge Cardoso Leitão > >> > > > > > <jorgecarlei...@gmail.com> wrote: > >> > > > > > > > >> > > > > > > Hi, > >> > > > > > > > >> > > > > > > First of all, I want to thank you very much for your work on > >> > > Ballista > >> > > > > and > >> > > > > > > for doing it in an open source environment. It is something > >> that > >> > > > should > >> > > > > > be > >> > > > > > > emphasised and celebrated. > >> > > > > > > > >> > > > > > > Secondly, wrt to considering donating it to the Apache > >> Foundation > >> > > and > >> > > > > > > Apache project in particular, I would say that we should be > >> > honored > >> > > > by > >> > > > > > such > >> > > > > > > consideration. In this context, my immediate reaction is: > how > >> can > >> > > we > >> > > > > best > >> > > > > > > support Ballista's community? > >> > > > > > > > >> > > > > > > My initial thoughts in this direction are: > >> > > > > > > > >> > > > > > > * create a new git repo for DataFusion and Ballista to > reside > >> on > >> > > > (e.g. > >> > > > > > > arrow/ballista) > >> > > > > > > * do not require the release cycle and versioning to be > >> aligned > >> > > with > >> > > > > > > arrow's release cycle > >> > > > > > > * do not require the usage of JIRA > >> > > > > > > * pin the dependency of Datafusion on Arrow and parquet > crate > >> > (e.g. > >> > > > to > >> > > > > a > >> > > > > > > specific commit) > >> > > > > > > > >> > > > > > > I feel that this setup would keep Ballista under the > >> Foundation > >> > and > >> > > > > > Apache > >> > > > > > > Arrow's umbrella and aligned with its goals, while at the > same > >> > time > >> > > > put > >> > > > > > the > >> > > > > > > least amount of burden on its community, both in terms of > >> > keeping a > >> > > > > > strict > >> > > > > > > release schedule, tooling and CI. > >> > > > > > > > >> > > > > > > The rationale for the above is that whenever something is > >> > released > >> > > on > >> > > > > > > DataFusion (which hosts most of the physical ops), people > will > >> > also > >> > > > > want > >> > > > > > it > >> > > > > > > quickly available on Ballista. Thus, having the two release > >> > cycles > >> > > > more > >> > > > > > > closely related and independent of the arrow > implementation's > >> > cycle > >> > > > is > >> > > > > > > good. DataFusion does not have integration tests against > other > >> > > arrow > >> > > > > > > implementations, and thus the integration tests are not > >> relevant. > >> > > > > > > > >> > > > > > > There are 4 main reasons I would not recommend placing it in > >> the > >> > > > > > mono-repo: > >> > > > > > > > >> > > > > > > 1. It would not add much > >> > > > > > > 2. It would place Ballista on the same release schedule and > >> git > >> > > > system > >> > > > > as > >> > > > > > > the rest of Arrow's implementation, which may not suit > >> Ballista's > >> > > own > >> > > > > > > development pace (in either direction) > >> > > > > > > 3. It further increases the complexity of the current repo > >> > > > > > > 4. It would force its community to use JIRA, merge process, > >> > > > components, > >> > > > > > > etc, which may not be what its community wishes for > >> > > > > > > > >> > > > > > > The main risk I see is that because arrow's release cycle is > >> slow > >> > > and > >> > > > > > major > >> > > > > > > releases only, DataFusion risks missing arrow features from > >> time > >> > to > >> > > > > time. > >> > > > > > > We can mitigate this with cargo and pins to commit hashes. > IMO > >> > this > >> > > > > risk > >> > > > > > > exists in any dependency relationship and is usually a sign > >> that > >> > > > there > >> > > > > is > >> > > > > > > an API contract and thus a trust relationship involved, > which > >> is > >> > a > >> > > > good > >> > > > > > > thing. > >> > > > > > > > >> > > > > > > Best, > >> > > > > > > Jorge > >> > > > > > > > >> > > > > > > On Tue, Mar 9, 2021 at 6:31 PM Andy Grove < > >> andygrov...@gmail.com > >> > > > >> > > > > wrote: > >> > > > > > > > >> > > > > > > > As many of you know, the reason that I got involved in > Arrow > >> > back > >> > > > in > >> > > > > > 2018 > >> > > > > > > > was that I wanted to build a distributed compute platform > in > >> > > Rust, > >> > > > > with > >> > > > > > > > capabilities similar to Apache Spark. This led to the > >> creation > >> > of > >> > > > the > >> > > > > > > > DataFusion query engine, which is an in-memory query > engine > >> and > >> > > is > >> > > > > now > >> > > > > > part > >> > > > > > > > of the Arrow repo. > >> > > > > > > > > >> > > > > > > > Over the past couple of years, I have been working outside > >> of > >> > > Arrow > >> > > > > on > >> > > > > > a > >> > > > > > > > project named “Ballista” [1] to continue the journey of > >> trying > >> > to > >> > > > > > build a > >> > > > > > > > distributed version. Due to the pandemic, I have had time > >> over > >> > > the > >> > > > > > winter > >> > > > > > > > to put more effort into this project and have managed to > >> build > >> > a > >> > > > > small > >> > > > > > > > community around it over the past few months and the > project > >> > has > >> > > > now > >> > > > > > > > reached a point where the basic architecture has been > proven > >> > and > >> > > it > >> > > > > is > >> > > > > > now > >> > > > > > > > getting a lot of attention (more than 2k stars on GitHub > >> just > >> > > > > > recently) and > >> > > > > > > > I think that it would now make sense to donate some or all > >> of > >> > the > >> > > > > > project > >> > > > > > > > to Apache Arrow and continue its growth here. > >> > > > > > > > > >> > > > > > > > For an overview of the project, please see the talk I > >> recently > >> > > gave > >> > > > > at > >> > > > > > the > >> > > > > > > > New York Open Statistical Programming Meetup [2]. > >> > > > > > > > > >> > > > > > > > Some of the benefits that I see in donating the project to > >> > Arrow > >> > > > are: > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > - > >> > > > > > > > > >> > > > > > > > DataFusion also needs a scheduler and it would probably > >> make > >> > > > sense > >> > > > > > to > >> > > > > > > > push some parts of the Ballista scheduler down a level > in > >> > the > >> > > > > stack > >> > > > > > so > >> > > > > > > > that > >> > > > > > > > the same approach is used to scale across cores in > >> > DataFusion > >> > > > and > >> > > > > to > >> > > > > > > > scale > >> > > > > > > > across nodes in Ballista. > >> > > > > > > > - > >> > > > > > > > > >> > > > > > > > Ballista provides preliminary support for spill-to-disk > >> > > > > > functionality > >> > > > > > > > (in Arrow IPC format) which could also benefit > DataFusion > >> > and > >> > > > > > provide > >> > > > > > > > better scalability through out-of-core processing. > >> > > > > > > > - > >> > > > > > > > > >> > > > > > > > Although the Ballista scheduler is implemented in Rust, > >> it > >> > is > >> > > > > > possible > >> > > > > > > > to implement executors in other languages due to the > use > >> of > >> > > > > Flight, > >> > > > > > > > gRPC, > >> > > > > > > > and protobuf, so this may be of interest to other > >> language > >> > > > > > > > implementations > >> > > > > > > > of Arrow as well. > >> > > > > > > > - > >> > > > > > > > > >> > > > > > > > There is already some overlap between Arrow and > Ballista > >> > > > > > contributors. > >> > > > > > > > - > >> > > > > > > > > >> > > > > > > > Ballista unit tests will be part of Arrow CI which > means > >> > that > >> > > > any > >> > > > > > > > changes to Arrow or DataFusion APIs that Ballista > >> depends on > >> > > > will > >> > > > > > also > >> > > > > > > > require that the corresponding Ballista code is updated > >> as > >> > > part > >> > > > of > >> > > > > > the > >> > > > > > > > same > >> > > > > > > > PR. > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > My main goal with this email thread is to gauge interest > in > >> > > > donating > >> > > > > > this > >> > > > > > > > code. If there is interest in doing so then we can have a > >> more > >> > > > > detailed > >> > > > > > > > follow-up conversation on exactly what would be donated > and > >> > where > >> > > > it > >> > > > > > would > >> > > > > > > > go. > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > I have also filed a GitHub issue in Ballista to get > feedback > >> > from > >> > > > > > current > >> > > > > > > > contributors [3]. > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > I'm looking forward to hearing opinions on this! > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > Thanks, > >> > > > > > > > > >> > > > > > > > Andy. > >> > > > > > > > > >> > > > > > > > [1] https://github.com/ballista-compute/ballista > >> > > > > > > > > >> > > > > > > > [2] https://www.youtube.com/watch?v=ZZHQaOap9pQ > >> > > > > > > > > >> > > > > > > > [3] > https://github.com/ballista-compute/ballista/issues/646 > >> > > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > > >