Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-16 Thread Jorge Cardoso Leitão
Hi, I agree. I skimmed through the code and it is a solid addition to me, so, again, thank you so much for donating it to Arrow. I look very much forward to learning more on how to build a scheduler that can adapt to both in-node and across-node executions :-) Best, Jorge On Tue, Mar 16, 2021

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-16 Thread Andy Grove
Thank you for all the responses so far. Based on this thread and the conversations happening in the Ballista project, I would say that the feedback is mostly positive and supportive of this donation, so I have started work on a PR [1] and will start a VOTE email thread once the PR is ready for revi

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-11 Thread Andy Grove
Hi Jack, Thanks for the input, and there are some interesting ideas there. If we were looking to break this into separate donations though I would actually consider 2+3 to be the first piece to incorporate into DataFusion because it would provide much better scalability compared to the current mo

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-11 Thread Jack Chan
Hey Andy I want to discuss the areas of Ballista code that you proposed above to move to Arrow. These are: 1. serde code for translating between protobuf and Arrow/DataFusion/Ballista data structures 2. Distributed query planner 3. Scheduler process that coordinates query execution across availabl

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-11 Thread Andy Grove
Thanks, Micah. Regarding integration testing, we currently have an integration test script in the repo that spins up multiple processes in docker compose and runs through a series of queries on a data set that can be generated locally. I invested in some modest hardware (a refurbed 12 core prolian

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-11 Thread Micah Kornfield
I think having Ballista in Arrow sounds like a good idea in the short term. It sounds like there is enough developer pain, that bringing it here makes sense (providing existing Ballista contributors are happy with the change and current Rust maintainers are open to the work involved). One longer

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andy Grove
Thanks for the feedback so far on this proposal. I really appreciate everyone taking the time to put so much thought (and passion!) into this. So far, I don't think anyone is opposed to the idea of donating Ballista but there are clearly concerns about an increased burden on current maintainers.

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Jorge Cardoso Leitão
Hi, Wes, thanks a lot for your reply. Let me try to answer: 1. If the purpose of Ballista is to support multiple language > executors, what does segregating it from the other PL's (where > executors are being developed, too) serve to facilitate this goal? > It facilitates because the stronger th

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Wes McKinney
hi Jorge, I have some thoughts / questions on your arguments against use of the monorepo: 1. If the purpose of Ballista is to support multiple language executors, what does segregating it from the other PL's (where executors are being developed, too) serve to facilitate this goal? 2. Use of the

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andrew Lamb
> think that the problem of "there are too many PRs in the review > queue that are not relevant to me" has straightforward solutions\ For sure -- I welcome any and all technical assistance to improving efficiency. > Andrew - do you have more specific concerns that I am missing here? I think bur

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Jorge Cardoso Leitão
Hi, First of all, I want to thank you very much for your work on Ballista and for doing it in an open source environment. It is something that should be emphasised and celebrated. Secondly, wrt to considering donating it to the Apache Foundation and Apache project in particular, I would say that

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Wes McKinney
I think that the problem of "there are too many PRs in the review queue that are not relevant to me" has straightforward solutions (like what Spark did https://spark-prs.appspot.com — if someone wants to fork this and make it work for Arrow that would be awesome, I would be willing to help if not o

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andy Grove
Wes - thanks for the clarification around possibilities for having multiple repositories within Arrow governance. I agree that having separate repos increases burdens around integration testing and dependency /release management and having a monorepo makes those things much simpler. I think it is

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andrew Lamb
Thanks Wes -- I agree. I think moving datafusion out of the main arrow repo only makes sense when the interfaces it depends on (in arrow and parquet) have stabilized as that will minimize the mess / pain you describe. Andrew On Wed, Mar 10, 2021 at 10:09 AM Wes McKinney wrote: > To give you a

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Wes McKinney
To give you an example of what I’m talking about. Jorge has been building this project https://github.com/jorgecarleitao/datafusion-python I think it would actually be preferable to build projects like this in the monorepo because of the challenges and opportunities that arise in long term projec

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Wes McKinney
There is no problem with having multiple code-containing repositories in Apache Arrow, and the project can produce different release artifacts (for example, Parquet has Parquet-format and Parquet-mr and these release separately). I don’t think it’s a good idea to fragment the project governance / s

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andy Grove
Thanks, Andrew. I agree with your points and I do see the argument for DataFusion/Ballista being in their own repo. When I first donated DataFusion there was a discussion about the fact that it could be moved back out later on once it was more mature. I will go see if I can find that conversation.

Re: [DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-10 Thread Andrew Lamb
My thoughts are: 1. The scheduler and spill-to-disk/out of core operations sound very good to bring into DataFusion and many people would benefit 2. I think the arrow github project and the unified workflow process in particular is reaching its limits. Adding another cool, but non trivial project

[DISCUSS] [Rust] Donate Ballista to Apache Arrow

2021-03-09 Thread Andy Grove
As many of you know, the reason that I got involved in Arrow back in 2018 was that I wanted to build a distributed compute platform in Rust, with capabilities similar to Apache Spark. This led to the creation of the DataFusion query engine, which is an in-memory query engine and is now part of the