Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

Li Jin Thu, 09 Nov 2017 09:35:31 -0800

However, this is currently broken in java refactor branch. I am fixing this
in https://issues.apache.org/jira/browse/ARROW-1779


On Thu, Nov 9, 2017 at 12:32 PM, Li Jin <ice.xell...@gmail.com> wrote:

> If null count is 0, the java library sets the validity vectors to all 1s.
>
> https://github.com/apache/arrow/blob/master/java/vector/
> src/main/java/org/apache/arrow/vector/BitVector.java#L61
>
>
> On Thu, Nov 9, 2017 at 12:23 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>
>> Yep, see https://github.com/apache/arrow/blob/master/format/Layout.
>> md#null-bitmaps
>>
>> "Arrays having a 0 null count may choose to not allocate the null bitmap."
>>
>> I do not know what the Java library will do in the event of 0 null
>> count and 0-length validity bitmap -- in theory this should be
>> accounted for already in the integration tests, but we might want to
>> double check for our own sanity. In C++ we do not even bother with the
>> validity bitmap when the null count is 0
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/
>> ipc/reader.cc#L142
>>
>> - Wes
>>
>> On Thu, Nov 9, 2017 at 12:09 PM, Brian Hulette <brian.hule...@ccri.com>
>> wrote:
>> > Ah! It didn't occur to me that a producer could just send a length-0
>> buffer
>> > since the reader implementations should ignore it anyway. I don't mind
>> the
>> > 16 byte cost of the metadata - I was referring to the bloat of a 100%
>> valid
>> > vector, which could be substantial.
>> >
>> > Part of me wants to argue that the the dictionary indices are already
>> > "special", since all other fields, including children, have their own
>> > nullable attribute in the schema, while an index is specified by a lone
>> > indexType. Treating the index more like a field makes it less special,
>> in my
>> > opinion. But that's just semantics, the ability to send length-0
>> buffers for
>> > a 100% valid index accomplishes what I'm after.
>> >
>> > Brian
>> >
>> >
>> >
>> > On 11/09/2017 10:00 AM, Wes McKinney wrote:
>> >>>
>> >>> So I'll go after the other validity vector - maybe producers should be
>> >>> allowed to omit the validity vector in the index? I just think if the
>> goal
>> >>> is to reduce bloat then redundant validity vectors seems like a
>> logical
>> >>> place to trim.
>> >>
>> >> Well, the cost of the additional buffer metadata is only 16 bytes --
>> >> on the wire I believe you are free to send a length-0 buffer if there
>> >> are no nulls. I am not sure this is worth making the dictionary
>> >> indices "special" during IPC reconstruction versus any other integer
>> >> vector.
>> >>
>> >> The metadata bloat that we're trimming by removing the buffer layouts
>> >> is more significant because the VectorLayout is a table, which has a
>> >> larger footprint in Flatbuffers
>> >>
>> >> On Thu, Nov 9, 2017 at 9:20 AM, Brian Hulette <brian.hule...@ccri.com>
>> >> wrote:
>> >>>
>> >>> Good point. Its a nice feature of the format that a dictionary batch
>> and
>> >>> a
>> >>> record batch with a single column look exactly the same when they
>> >>> represent
>> >>> the same logical type.
>> >>>
>> >>>
>> >>> So I'll go after the other validity vector - maybe producers should be
>> >>> allowed to omit the validity vector in the index? I just think if the
>> >>> goal
>> >>> is to reduce bloat then redundant validity vectors seems like a
>> logical
>> >>> place to trim.
>> >>>
>> >>> Producers would need some way to communicate whether or not the index
>> is
>> >>> nullable. Right now there's only a single nullable flag in the Field
>> >>> metadata
>> >>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280),
>> >>> which
>> >>> determines whether or not both the index and the value vectors have a
>> >>> validity vector. What if there were a second nullable flag in the
>> >>> DictionaryEncoding table
>> >>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L252)
>> that
>> >>> applies to the indexType?
>> >>>
>> >>> This idea does lead to one confusing edge case: a non-nullable
>> dictionary
>> >>> vector with a nullable index. Maybe that should be allowed though,
>> that
>> >>> would effectively represent the scheme that I was originally
>> advocating
>> >>> for
>> >>> (validity vector in the index and not in the values).
>> >>>
>> >>> Brian
>> >>>
>> >>>
>> >>>
>> >>> On 11/08/2017 06:27 PM, Wes McKinney wrote:
>> >>>>
>> >>>> The dictionary batches simply wrap a record batch with one “column”.
>> >>>> There
>> >>>> should be no code difference (e.g. buffer layouts are the same)
>> between
>> >>>> the
>> >>>> code handling the data in a dictionary and a normal record batches.
>> In
>> >>>> general, a dictionary may contain a null.
>> >>>>
>> >>>> On Wed, Nov 8, 2017 at 4:05 PM Brian Hulette <brian.hule...@ccri.com
>> >
>> >>>> wrote:
>> >>>>
>> >>>>> Agreed, that sounds like a great solution to this problem - the
>> layout
>> >>>>> information is redundant and it doesn't make sense to include it in
>> >>>>> every schema.
>> >>>>>
>> >>>>> Although I would argue we should write down exactly what buffers are
>> >>>>> supposed to go on the wire in the dictionary batches (i.e. value
>> >>>>> vectors) as well. This should be largely the same as what goes on
>> the
>> >>>>> wire in a record batch for a non dictionary-encoded vector of the
>> same
>> >>>>> type, but there could be a difference. For example, if a dictionary
>> >>>>> vector is nullable, do we really need a validity buffer in both the
>> >>>>> index and in the values? I think that's the current behavior, but
>> maybe
>> >>>>> it would make sense to assert that a dictionary's value vector
>> should
>> >>>>> be
>> >>>>> non-nullable, and nulls should be handled in the index vector?
>> >>>>>
>> >>>>> Brian
>> >>>>>
>> >>>>>
>> >>>>> On 11/08/2017 05:24 PM, Wes McKinney wrote:
>> >>>>>>
>> >>>>>> Per Jacques' comment in ARROW-1693
>> >>>>>>
>> >>>>>
>> >>>>> https://issues.apache.org/jira/browse/ARROW-1693?page=com.
>> atlassian.jira.plugin.system.issuetabpanels:comment-tabpane
>> l&focusedCommentId=16244812#comment-16244812
>> >>>>> ,
>> >>>>>>
>> >>>>>> I think we should remove the buffer layout from the metadata. It
>> would
>> >>>>>> be a good idea to do this for 0.8.0 since we're breaking the
>> metadata
>> >>>>>> anyway.
>> >>>>>>
>> >>>>>> In addition to bloating the size of the schemas on the wire, the
>> >>>>>> buffer layout metadata provides redundant information which should
>> be
>> >>>>>> a strict part of the Arrow specification. I agree with Jacques
>> that it
>> >>>>>> would be better to write down exactly what buffers are supposed to
>> go
>> >>>>>> on the wire for each logical type. In the case of the dictionary
>> >>>>>> vectors, it is the buffers for the indices, so the issue under
>> >>>>>> discussion resolves itself if we nix the metadata.
>> >>>>>>
>> >>>>>> If writers are emitting possibly different buffer layouts (like
>> >>>>>> omitting a null or zero-length buffer), it will introduce
>> brittleness
>> >>>>>> and cause much special casing to trickle down into the reader
>> >>>>>> implementations. This seems like undue complexity.
>> >>>>>>
>> >>>>>> - Wes
>> >>>>>>
>> >>>>>> On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette <
>> brian.hule...@ccri.com>
>> >>>>>
>> >>>>> wrote:
>> >>>>>>>
>> >>>>>>> We've been having some integration issues with reading Dictionary
>> >>>>>
>> >>>>> Vectors in
>> >>>>>>>
>> >>>>>>> the JS implementation - our current implementation can read arrow
>> >>>>>>> files
>> >>>>>
>> >>>>> and
>> >>>>>>>
>> >>>>>>> streams generated by Java, but not by C++. Most of this
>> discussion is
>> >>>>>>> captured in ARROW-1693 [1].
>> >>>>>>>
>> >>>>>>> It looks like ultimately the issue is that there are
>> inconsistencies
>> >>>>>>> in
>> >>>>>
>> >>>>> the
>> >>>>>>>
>> >>>>>>> way the various implementations handle buffer layouts for
>> >>>>>
>> >>>>> dictionary-encoded
>> >>>>>>>
>> >>>>>>> vectors in the Schema message. Some places write/read the buffer
>> >>>>>>> layout
>> >>>>>
>> >>>>> for
>> >>>>>>>
>> >>>>>>> the value vector (the vector found in the dictionary batch), and
>> >>>>>>> others
>> >>>>>>> expect the layout for the index vector (the int vector found in
>> the
>> >>>>>
>> >>>>> record
>> >>>>>>>
>> >>>>>>> batch). Both the Java and C++ IPC readers don't seem to care about
>> >>>>>>> this
>> >>>>>>> portion of the Schema, which explains why the integration tests
>> are
>> >>>>>
>> >>>>> passing.
>> >>>>>>>
>> >>>>>>> Here's a fun ASCII table of how I think the Java/C++/JS IPC
>> readers
>> >>>>>>> and
>> >>>>>>> writers handle those buffers layouts right now:
>> >>>>>>>
>> >>>>>>>         | Writer       | Reader
>> >>>>>>> -----+--------------+-------------
>> >>>>>>> Java | value vector | doesn't care
>> >>>>>>> C++  | index vector | doesn't care
>> >>>>>>> JS   | N/A          | value vector
>> >>>>>>>
>> >>>>>>> Note that I can only really speak with authority about the JS
>> >>>>>>> implementation. I'd appreciate it if people more familiar with the
>> >>>>>
>> >>>>> other two
>> >>>>>>>
>> >>>>>>> could validate my claims.
>> >>>>>>>
>> >>>>>>> As far as I can tell the expected behavior isn't stated anywhere
>> in
>> >>>>>>> the
>> >>>>>>> documentation, which I suppose explains the inconsistency. Paul
>> >>>>>>> Taylor
>> >>>>>
>> >>>>> is
>> >>>>>>>
>> >>>>>>> currently working on resolving ARROW-1693 by making the JS reader
>> >>>>>
>> >>>>> ambivalent
>> >>>>>>>
>> >>>>>>> to buffer layout, but I think ultimately the correct solution is
>> to
>> >>>>>
>> >>>>> agree on
>> >>>>>>>
>> >>>>>>> a consistent standard, and make the reader implementations
>> >>>>>>> opinionated
>> >>>>>
>> >>>>> about
>> >>>>>>>
>> >>>>>>> the Schema buffer layouts (i.e. ARROW-1362 [2]).
>> >>>>>>>
>> >>>>>>> Personally, I don't really have an opinion either way about which
>> >>>>>
>> >>>>> vector's
>> >>>>>>>
>> >>>>>>> layout should be in the Schema. Either way we'll be missing some
>> >>>>>>> layout
>> >>>>>>> information though, so we should also consider where the
>> information
>> >>>>>
>> >>>>> for the
>> >>>>>>>
>> >>>>>>> "other" vector might go.
>> >>>>>>>
>> >>>>>>> I know there's a release coming up, and now is probably not the
>> time
>> >>>>>>> to
>> >>>>>>> tackle this problem, but I wanted to write it up while its fresh
>> in
>> >>>>>>> my
>> >>>>>
>> >>>>> mind.
>> >>>>>>>
>> >>>>>>> I'm fine shelving it until after 0.8.
>> >>>>>>>
>> >>>>>>> Brian
>> >>>>>>>
>> >>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-1693
>> >>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-1362
>> >>>>>
>> >>>>>
>> >
>>
>
>

Re: [DISCUSS] Buffer Layouts and Dictionary Vectors

Reply via email to