Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-18 Thread Andrew Lamb
Any higher level physical execution plan most likely needs a way to represent expressions. Thus focusing initially on a standard for expressions might be a good way to add value but keep the scope of the effort reasonable On Thu, Mar 18, 2021 at 11:49 AM Micah Kornfield wrote: > I think there mi

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-18 Thread Jorge Cardoso Leitão
Hi, The main benefit I see for a standard for queries would not be on a serialization format, but on its semantics. IMO one of the main reasons for a lack of a standard of queries at the protobuf level is that human-readability vastly outweighs serialization - queries are at very most a megabyte

[Rust][DataFusion] Proposal to add SHOW TABLES and SHOW COLUMNS + information_schema support

2021-03-18 Thread Andrew Lamb
In order to allow easier data exploration and integration with DataFusion based systems, I propose adding SHOW TABLES and SHOW COLUMNS + partial information_schema support to DataFusion. There is a proposal[2] , linked to ARROW-12020[1] should you be interested in commenting [1] https://issues.

Re: No replacement dictionaries supported in pyarrow?

2021-03-18 Thread Micah Kornfield
Hmm, I noticed this "The IPC file format doesn't support dictionary replacements or deltas." I was under the impression we aimed to support dictionary deltas in the file format. If not we should remove "Delta dictionaries are applied in the order they appear in the file footer." from the specifica

Re: No replacement dictionaries supported in pyarrow?

2021-03-18 Thread Nate Bauernfeind
If dictionary replacements were supported, then the IPC file format couldn't guarantee random access reads. Personally, I would like to support a stream-based file format that is a series of the Flight protobufs. In my extension of arrow flight, by stuffing our state-based data into the app_metada

[Gandiva] Active maintainers?

2021-03-18 Thread Micah Kornfield
Is anybody actively looking at PRs for Gandiva? There seems to be queue building 18 (or so open). The committers that seemed to be active in the past don't seem to be responding to pings through Github. Thanks, Micah

Re: [Gandiva] Active maintainers?

2021-03-18 Thread Vivekanand Vellanki
Micah, My team is looking at the PRs. We are giving feedback. We are also in touch with the committers (Ravindra and Praveen) and will get them merged. Thanks Vivek On Fri, Mar 19, 2021 at 9:29 AM Micah Kornfield wrote: > Is anybody actively looking at PRs for Gandiva? There seems to be queu

[JIRA Permissions] Assigning myself to ARROW-11901

2021-03-18 Thread Benjamin Wilhelm
Hi all, I would like to contribute to Arrow by working on the performance issues with the newly introduced LZ4 compression in Java (JIRA: https://issues.apache.org/jira/browse/ARROW-11901). Can someone make me a "Contributor" in JIRA so I can assign myself? Thank you! Benjamin Wilhelm

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-18 Thread Benjamin Wilhelm
> > > 1) contribute the missing support ourselves > I actually think we might need to proceed with this option. I agree. I am willing to help with this and explore and try different approaches. I would start looking into the JNI approach. Contributing back to lz4-java or adding this to Arrow. Be

Re: [JIRA Permissions] Assigning myself to ARROW-11901

2021-03-18 Thread Antoine Pitrou
Hi Benjamin, This should be done. Regards Antoine. Le 18/03/2021 à 10:58, Benjamin Wilhelm a écrit : Hi all, I would like to contribute to Arrow by working on the performance issues with the newly introduced LZ4 compression in Java (JIRA: https://issues.apache.org/jira/browse/ARROW-11901)

Re: [Rust][DataFusion] Query Engine Design / DataFusion Implementation talk

2021-03-18 Thread Andrew Lamb
The system you describe sounds quite cool. I don't know what is going on the Java world -- as you say I think there is work a foot for technologies similar in usecase to DataFusion in C++ (though I suspect the implementation will be fairly different) On Wed, Mar 17, 2021 at 5:37 PM bobtins wrot

[DISCUSS] How to describe computation on Arrow data?

2021-03-18 Thread paddy horan
Hi All, I do not have a computer science background so I may not be asking this in the correct way or using the correct terminology but I wonder if we can achieve some level of standardization when describing computation over Arrow data. At the moment on the Rust side DataFusion clearly has a w

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-18 Thread Andy Grove
Hi Paddy, Thanks for raising this. Ballista defines computations using protobuf [1] to describe logical and physical query plans, which consist of operators and expressions. It is actually based on the Gandiva protobuf [2] for describing expressions. I see a lot of value in standardizing some of

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-18 Thread Wes McKinney
I completely agree with developing a common “query protocol” or “physical execution plan” IR + serialization scheme inside Apache Arrow. It may take some time to stabilize so we should try to avoid being hasty in closing it to change until more time has elapsed to allow requirements to percolate.

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-18 Thread Jed Brown
I'm interested in providing some path to make this extensible. To pick an example, suppose the user wants to compute the first k principle components. We've talked [1] about the possibility of incorporating richer communication semantics in Ballista (a la MPI sub-communicators) and numerical alg

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-18 Thread Brian Hulette
I agree this would be a great development. It would also be useful for leveraging compute engines from JS via wasm. I've thought about something like this in the context of multi-language relational workloads in Apache Beam, mostly just leading me to wonder if something like it already exists. But

Re: No replacement dictionaries supported in pyarrow?

2021-03-18 Thread Neal Richardson
Somewhat related issue: https://issues.apache.org/jira/browse/ARROW-10406 On Wed, Mar 17, 2021 at 11:22 PM Micah Kornfield wrote: > BTW, this nuance always felt a little strange to me, but would have > required adding additional information to the file format, to disambiguate > when exactly a di

Re: No replacement dictionaries supported in pyarrow?

2021-03-18 Thread Jacob Quinn
Ah, interesting. So to make sure I understand correctly, the C++ write implementation will scan all "batches" and unify all dictionary values before writing out the schema + dictionary messages? But only when writing the file format? In the streaming case, it would still write replacement/delta dic

Re: No replacement dictionaries supported in pyarrow?

2021-03-18 Thread Antoine Pitrou
It's a bit more configurable, but basically yes. See the IPC write options: https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L73 Regards Antoine. Le 18/03/2021 à 16:37, Jacob Quinn a écrit : Ah, interesting. So to make sure I understand correctly, the C++ write imple

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-18 Thread Micah Kornfield
I think there might be discussion on two levels of computation, physical query execution plans, and potentially something "lower level"? When this has come up in the past, I was a little skeptical of constraining every SDK to use the same description, so I agree with Wes's point about keeping any

Re: [Rust][DataFusion] Query Engine Design / DataFusion Implementation talk

2021-03-18 Thread Micah Kornfield
For Java/JVM there is also a discussion on user@ about dataframe libraries. On Thu, Mar 18, 2021 at 5:47 AM Andrew Lamb wrote: > The system you describe sounds quite cool. I don't know what is going on > the Java world -- as you say I think there is work a foot for technologies > similar in use

Re: [DISCUSS] Revisiting LZ4 Compression for Arrow Buffers

2021-03-18 Thread Micah Kornfield
> > I would start looking into the JNI approach. Contributing back > to lz4-java or adding this to Arrow. A first step might be to compare the performance of the JNI approach vs Airlift. The airlift library only uses Java and claims to be potentially faster. A JNI approach has the downside of r

Re: [DISCUSS] How to describe computation on Arrow data?

2021-03-18 Thread Andy Grove
There was a Google design doc from back in 2019 [1] where we discussed this (or something similar at least). I also remember reading about Weld's IR. It would be good to learn from their work. [1] https://docs.google.com/document/d/1Uv1FmPs7uYMLoJUH1EF0oxm-ujtz1h1tJFl0zN60TIg/edit?usp=sharing On