Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

Wes McKinney Wed, 08 Nov 2017 14:26:02 -0800

Per Jacques' comment in ARROW-1693
https://issues.apache.org/jira/browse/ARROW-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244812#comment-16244812,
I think we should remove the buffer layout from the metadata. It would
be a good idea to do this for 0.8.0 since we're breaking the metadata
anyway.


In addition to bloating the size of the schemas on the wire, the
buffer layout metadata provides redundant information which should be
a strict part of the Arrow specification. I agree with Jacques that it
would be better to write down exactly what buffers are supposed to go
on the wire for each logical type. In the case of the dictionary
vectors, it is the buffers for the indices, so the issue under
discussion resolves itself if we nix the metadata.

If writers are emitting possibly different buffer layouts (like
omitting a null or zero-length buffer), it will introduce brittleness
and cause much special casing to trickle down into the reader
implementations. This seems like undue complexity.

- Wes

On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette <brian.hule...@ccri.com> wrote:
> We've been having some integration issues with reading Dictionary Vectors in
> the JS implementation - our current implementation can read arrow files and
> streams generated by Java, but not by C++. Most of this discussion is
> captured in ARROW-1693 [1].
>
> It looks like ultimately the issue is that there are inconsistencies in the
> way the various implementations handle buffer layouts for dictionary-encoded
> vectors in the Schema message. Some places write/read the buffer layout for
> the value vector (the vector found in the dictionary batch), and others
> expect the layout for the index vector (the int vector found in the record
> batch). Both the Java and C++ IPC readers don't seem to care about this
> portion of the Schema, which explains why the integration tests are passing.
> Here's a fun ASCII table of how I think the Java/C++/JS IPC readers and
> writers handle those buffers layouts right now:
>
>      | Writer       | Reader
> -----+--------------+-------------
> Java | value vector | doesn't care
> C++  | index vector | doesn't care
> JS   | N/A          | value vector
>
> Note that I can only really speak with authority about the JS
> implementation. I'd appreciate it if people more familiar with the other two
> could validate my claims.
>
> As far as I can tell the expected behavior isn't stated anywhere in the
> documentation, which I suppose explains the inconsistency. Paul Taylor is
> currently working on resolving ARROW-1693 by making the JS reader ambivalent
> to buffer layout, but I think ultimately the correct solution is to agree on
> a consistent standard, and make the reader implementations opinionated about
> the Schema buffer layouts (i.e. ARROW-1362 [2]).
>
> Personally, I don't really have an opinion either way about which vector's
> layout should be in the Schema. Either way we'll be missing some layout
> information though, so we should also consider where the information for the
> "other" vector might go.
>
> I know there's a release coming up, and now is probably not the time to
> tackle this problem, but I wanted to write it up while its fresh in my mind.
> I'm fine shelving it until after 0.8.
>
> Brian
>
> [1] https://issues.apache.org/jira/browse/ARROW-1693
> [2] https://issues.apache.org/jira/browse/ARROW-1362

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

Reply via email to