Any higher level physical execution plan most likely needs a way to
represent expressions. Thus focusing initially on a standard for
expressions might be a good way to add value but keep the scope of the
effort reasonable
On Thu, Mar 18, 2021 at 11:49 AM Micah Kornfield
wrote:
> I think there mi
Hi,
The main benefit I see for a standard for queries would not be on a
serialization format, but on its semantics.
IMO one of the main reasons for a lack of a standard of queries at the
protobuf level is that human-readability vastly outweighs serialization -
queries are at very most a megabyte
In order to allow easier data exploration and integration with DataFusion
based systems, I propose adding SHOW TABLES and SHOW COLUMNS + partial
information_schema support to DataFusion.
There is a proposal[2] , linked to ARROW-12020[1] should you be interested
in commenting
[1] https://issues.
Hmm, I noticed this "The IPC file format doesn't support dictionary
replacements or deltas." I was under the impression we aimed to support
dictionary deltas in the file format. If not we should remove "Delta
dictionaries are applied in the order they appear in the file footer." from
the specifica
If dictionary replacements were supported, then the IPC file format
couldn't guarantee random access reads.
Personally, I would like to support a stream-based file format that is a
series of the Flight protobufs. In my extension of arrow flight, by
stuffing our state-based data into the app_metada
Is anybody actively looking at PRs for Gandiva? There seems to be queue
building 18 (or so open). The committers that seemed to be active in the
past don't seem to be responding to pings through Github.
Thanks,
Micah
Micah,
My team is looking at the PRs. We are giving feedback.
We are also in touch with the committers (Ravindra and Praveen) and will
get them merged.
Thanks
Vivek
On Fri, Mar 19, 2021 at 9:29 AM Micah Kornfield
wrote:
> Is anybody actively looking at PRs for Gandiva? There seems to be queu
Hi all,
I would like to contribute to Arrow by working on the performance issues
with the newly introduced LZ4 compression in Java (JIRA:
https://issues.apache.org/jira/browse/ARROW-11901).
Can someone make me a "Contributor" in JIRA so I can assign myself?
Thank you!
Benjamin Wilhelm
>
> > 1) contribute the missing support ourselves
> I actually think we might need to proceed with this option.
I agree. I am willing to help with this and explore and try different
approaches. I would start looking into the JNI approach. Contributing back
to lz4-java or adding this to Arrow.
Be
Hi Benjamin,
This should be done.
Regards
Antoine.
Le 18/03/2021 à 10:58, Benjamin Wilhelm a écrit :
Hi all,
I would like to contribute to Arrow by working on the performance issues
with the newly introduced LZ4 compression in Java (JIRA:
https://issues.apache.org/jira/browse/ARROW-11901)
The system you describe sounds quite cool. I don't know what is going on
the Java world -- as you say I think there is work a foot for technologies
similar in usecase to DataFusion in C++ (though I suspect the
implementation will be fairly different)
On Wed, Mar 17, 2021 at 5:37 PM bobtins wrot
Hi All,
I do not have a computer science background so I may not be asking this in the
correct way or using the correct terminology but I wonder if we can achieve
some level of standardization when describing computation over Arrow data.
At the moment on the Rust side DataFusion clearly has a w
Hi Paddy,
Thanks for raising this.
Ballista defines computations using protobuf [1] to describe logical and
physical query plans, which consist of operators and expressions. It is
actually based on the Gandiva protobuf [2] for describing expressions.
I see a lot of value in standardizing some of
I completely agree with developing a common “query protocol” or “physical
execution plan” IR + serialization scheme inside Apache Arrow. It may take
some time to stabilize so we should try to avoid being hasty in closing it
to change until more time has elapsed to allow requirements to percolate.
I'm interested in providing some path to make this extensible. To pick an
example, suppose the user wants to compute the first k principle components.
We've talked [1] about the possibility of incorporating richer communication
semantics in Ballista (a la MPI sub-communicators) and numerical alg
I agree this would be a great development. It would also be useful for
leveraging compute engines from JS via wasm.
I've thought about something like this in the context of multi-language
relational workloads in Apache Beam, mostly just leading me to wonder if
something like it already exists. But
Somewhat related issue: https://issues.apache.org/jira/browse/ARROW-10406
On Wed, Mar 17, 2021 at 11:22 PM Micah Kornfield
wrote:
> BTW, this nuance always felt a little strange to me, but would have
> required adding additional information to the file format, to disambiguate
> when exactly a di
Ah, interesting. So to make sure I understand correctly, the C++ write
implementation will scan all "batches" and unify all dictionary values
before writing out the schema + dictionary messages? But only when writing
the file format? In the streaming case, it would still write
replacement/delta dic
It's a bit more configurable, but basically yes. See the IPC write options:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/options.h#L73
Regards
Antoine.
Le 18/03/2021 à 16:37, Jacob Quinn a écrit :
Ah, interesting. So to make sure I understand correctly, the C++ write
imple
I think there might be discussion on two levels of computation, physical
query execution plans, and potentially something "lower level"? When this
has come up in the past, I was a little skeptical of constraining every SDK
to use the same description, so I agree with Wes's point about keeping any
For Java/JVM there is also a discussion on user@ about dataframe libraries.
On Thu, Mar 18, 2021 at 5:47 AM Andrew Lamb wrote:
> The system you describe sounds quite cool. I don't know what is going on
> the Java world -- as you say I think there is work a foot for technologies
> similar in use
>
> I would start looking into the JNI approach. Contributing back
> to lz4-java or adding this to Arrow.
A first step might be to compare the performance of the JNI approach vs
Airlift. The airlift library only uses Java and claims to be potentially
faster. A JNI approach has the downside of r
There was a Google design doc from back in 2019 [1] where we discussed this
(or something similar at least).
I also remember reading about Weld's IR. It would be good to learn from
their work.
[1]
https://docs.google.com/document/d/1Uv1FmPs7uYMLoJUH1EF0oxm-ujtz1h1tJFl0zN60TIg/edit?usp=sharing
On
23 matches
Mail list logo