Hi,
I agree.
I skimmed through the code and it is a solid addition to me, so, again,
thank you so much for donating it to Arrow.
I look very much forward to learning more on how to build a scheduler that
can adapt to both in-node and across-node executions :-)
Best,
Jorge
On Tue, Mar 16, 2021
Thank you for all the responses so far. Based on this thread and the
conversations happening in the Ballista project, I would say that the
feedback is mostly positive and supportive of this donation, so I have
started work on a PR [1] and will start a VOTE email thread once the PR is
ready for revi
Hi Jack,
Thanks for the input, and there are some interesting ideas there.
If we were looking to break this into separate donations though I would
actually consider 2+3 to be the first piece to incorporate into DataFusion
because it would provide much better scalability compared to the current
mo
Hey Andy
I want to discuss the areas of Ballista code that you proposed above to
move to Arrow. These are:
1. serde code for translating between protobuf and
Arrow/DataFusion/Ballista data structures
2. Distributed query planner
3. Scheduler process that coordinates query execution across availabl
Thanks, Micah.
Regarding integration testing, we currently have an integration test script
in the repo that spins up multiple processes in docker compose and runs
through a series of queries on a data set that can be generated locally. I
invested in some modest hardware (a refurbed 12 core prolian
I think having Ballista in Arrow sounds like a good idea in the short
term. It sounds like there is enough developer pain, that bringing it here
makes sense (providing existing Ballista contributors are happy with the
change and current Rust maintainers are open to the work involved).
One longer
Thanks for the feedback so far on this proposal. I really appreciate
everyone taking the time to put so much thought (and passion!) into this.
So far, I don't think anyone is opposed to the idea of donating Ballista
but there are clearly concerns about an increased burden on current
maintainers.
Hi,
Wes, thanks a lot for your reply. Let me try to answer:
1. If the purpose of Ballista is to support multiple language
> executors, what does segregating it from the other PL's (where
> executors are being developed, too) serve to facilitate this goal?
>
It facilitates because the stronger th
hi Jorge,
I have some thoughts / questions on your arguments against use of the monorepo:
1. If the purpose of Ballista is to support multiple language
executors, what does segregating it from the other PL's (where
executors are being developed, too) serve to facilitate this goal?
2. Use of the
> think that the problem of "there are too many PRs in the review
> queue that are not relevant to me" has straightforward solutions\
For sure -- I welcome any and all technical assistance to improving
efficiency.
> Andrew - do you have more specific concerns that I am missing here?
I think bur
Hi,
First of all, I want to thank you very much for your work on Ballista and
for doing it in an open source environment. It is something that should be
emphasised and celebrated.
Secondly, wrt to considering donating it to the Apache Foundation and
Apache project in particular, I would say that
I think that the problem of "there are too many PRs in the review
queue that are not relevant to me" has straightforward solutions (like
what Spark did https://spark-prs.appspot.com — if someone wants to
fork this and make it work for Arrow that would be awesome, I would be
willing to help if not o
Wes - thanks for the clarification around possibilities for having multiple
repositories within Arrow governance. I agree that having separate repos
increases burdens around integration testing and dependency /release
management and having a monorepo makes those things much simpler.
I think it is
Thanks Wes -- I agree. I think moving datafusion out of the main arrow repo
only makes sense when the interfaces it depends on (in arrow and parquet)
have stabilized as that will minimize the mess / pain you describe.
Andrew
On Wed, Mar 10, 2021 at 10:09 AM Wes McKinney wrote:
> To give you a
To give you an example of what I’m talking about. Jorge has been building
this project
https://github.com/jorgecarleitao/datafusion-python
I think it would actually be preferable to build projects like this in the
monorepo because of the challenges and opportunities that arise in long
term projec
There is no problem with having multiple code-containing repositories in
Apache Arrow, and the project can produce different release artifacts (for
example, Parquet has Parquet-format and Parquet-mr and these release
separately). I don’t think it’s a good idea to fragment the project
governance / s
Thanks, Andrew.
I agree with your points and I do see the argument for DataFusion/Ballista
being in their own repo. When I first donated DataFusion there was a
discussion about the fact that it could be moved back out later on once it
was more mature. I will go see if I can find that conversation.
My thoughts are:
1. The scheduler and spill-to-disk/out of core operations sound very good
to bring into DataFusion and many people would benefit
2. I think the arrow github project and the unified workflow process in
particular is reaching its limits. Adding another cool, but non trivial
project
As many of you know, the reason that I got involved in Arrow back in 2018
was that I wanted to build a distributed compute platform in Rust, with
capabilities similar to Apache Spark. This led to the creation of the
DataFusion query engine, which is an in-memory query engine and is now part
of the
19 matches
Mail list logo