However, this is currently broken in java refactor branch. I am fixing this in https://issues.apache.org/jira/browse/ARROW-1779
On Thu, Nov 9, 2017 at 12:32 PM, Li Jin <ice.xell...@gmail.com> wrote: > If null count is 0, the java library sets the validity vectors to all 1s. > > https://github.com/apache/arrow/blob/master/java/vector/ > src/main/java/org/apache/arrow/vector/BitVector.java#L61 > > > On Thu, Nov 9, 2017 at 12:23 PM, Wes McKinney <wesmck...@gmail.com> wrote: > >> Yep, see https://github.com/apache/arrow/blob/master/format/Layout. >> md#null-bitmaps >> >> "Arrays having a 0 null count may choose to not allocate the null bitmap." >> >> I do not know what the Java library will do in the event of 0 null >> count and 0-length validity bitmap -- in theory this should be >> accounted for already in the integration tests, but we might want to >> double check for our own sanity. In C++ we do not even bother with the >> validity bitmap when the null count is 0 >> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ >> ipc/reader.cc#L142 >> >> - Wes >> >> On Thu, Nov 9, 2017 at 12:09 PM, Brian Hulette <brian.hule...@ccri.com> >> wrote: >> > Ah! It didn't occur to me that a producer could just send a length-0 >> buffer >> > since the reader implementations should ignore it anyway. I don't mind >> the >> > 16 byte cost of the metadata - I was referring to the bloat of a 100% >> valid >> > vector, which could be substantial. >> > >> > Part of me wants to argue that the the dictionary indices are already >> > "special", since all other fields, including children, have their own >> > nullable attribute in the schema, while an index is specified by a lone >> > indexType. Treating the index more like a field makes it less special, >> in my >> > opinion. But that's just semantics, the ability to send length-0 >> buffers for >> > a 100% valid index accomplishes what I'm after. >> > >> > Brian >> > >> > >> > >> > On 11/09/2017 10:00 AM, Wes McKinney wrote: >> >>> >> >>> So I'll go after the other validity vector - maybe producers should be >> >>> allowed to omit the validity vector in the index? I just think if the >> goal >> >>> is to reduce bloat then redundant validity vectors seems like a >> logical >> >>> place to trim. >> >> >> >> Well, the cost of the additional buffer metadata is only 16 bytes -- >> >> on the wire I believe you are free to send a length-0 buffer if there >> >> are no nulls. I am not sure this is worth making the dictionary >> >> indices "special" during IPC reconstruction versus any other integer >> >> vector. >> >> >> >> The metadata bloat that we're trimming by removing the buffer layouts >> >> is more significant because the VectorLayout is a table, which has a >> >> larger footprint in Flatbuffers >> >> >> >> On Thu, Nov 9, 2017 at 9:20 AM, Brian Hulette <brian.hule...@ccri.com> >> >> wrote: >> >>> >> >>> Good point. Its a nice feature of the format that a dictionary batch >> and >> >>> a >> >>> record batch with a single column look exactly the same when they >> >>> represent >> >>> the same logical type. >> >>> >> >>> >> >>> So I'll go after the other validity vector - maybe producers should be >> >>> allowed to omit the validity vector in the index? I just think if the >> >>> goal >> >>> is to reduce bloat then redundant validity vectors seems like a >> logical >> >>> place to trim. >> >>> >> >>> Producers would need some way to communicate whether or not the index >> is >> >>> nullable. Right now there's only a single nullable flag in the Field >> >>> metadata >> >>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L280), >> >>> which >> >>> determines whether or not both the index and the value vectors have a >> >>> validity vector. What if there were a second nullable flag in the >> >>> DictionaryEncoding table >> >>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L252) >> that >> >>> applies to the indexType? >> >>> >> >>> This idea does lead to one confusing edge case: a non-nullable >> dictionary >> >>> vector with a nullable index. Maybe that should be allowed though, >> that >> >>> would effectively represent the scheme that I was originally >> advocating >> >>> for >> >>> (validity vector in the index and not in the values). >> >>> >> >>> Brian >> >>> >> >>> >> >>> >> >>> On 11/08/2017 06:27 PM, Wes McKinney wrote: >> >>>> >> >>>> The dictionary batches simply wrap a record batch with one “column”. >> >>>> There >> >>>> should be no code difference (e.g. buffer layouts are the same) >> between >> >>>> the >> >>>> code handling the data in a dictionary and a normal record batches. >> In >> >>>> general, a dictionary may contain a null. >> >>>> >> >>>> On Wed, Nov 8, 2017 at 4:05 PM Brian Hulette <brian.hule...@ccri.com >> > >> >>>> wrote: >> >>>> >> >>>>> Agreed, that sounds like a great solution to this problem - the >> layout >> >>>>> information is redundant and it doesn't make sense to include it in >> >>>>> every schema. >> >>>>> >> >>>>> Although I would argue we should write down exactly what buffers are >> >>>>> supposed to go on the wire in the dictionary batches (i.e. value >> >>>>> vectors) as well. This should be largely the same as what goes on >> the >> >>>>> wire in a record batch for a non dictionary-encoded vector of the >> same >> >>>>> type, but there could be a difference. For example, if a dictionary >> >>>>> vector is nullable, do we really need a validity buffer in both the >> >>>>> index and in the values? I think that's the current behavior, but >> maybe >> >>>>> it would make sense to assert that a dictionary's value vector >> should >> >>>>> be >> >>>>> non-nullable, and nulls should be handled in the index vector? >> >>>>> >> >>>>> Brian >> >>>>> >> >>>>> >> >>>>> On 11/08/2017 05:24 PM, Wes McKinney wrote: >> >>>>>> >> >>>>>> Per Jacques' comment in ARROW-1693 >> >>>>>> >> >>>>> >> >>>>> https://issues.apache.org/jira/browse/ARROW-1693?page=com. >> atlassian.jira.plugin.system.issuetabpanels:comment-tabpane >> l&focusedCommentId=16244812#comment-16244812 >> >>>>> , >> >>>>>> >> >>>>>> I think we should remove the buffer layout from the metadata. It >> would >> >>>>>> be a good idea to do this for 0.8.0 since we're breaking the >> metadata >> >>>>>> anyway. >> >>>>>> >> >>>>>> In addition to bloating the size of the schemas on the wire, the >> >>>>>> buffer layout metadata provides redundant information which should >> be >> >>>>>> a strict part of the Arrow specification. I agree with Jacques >> that it >> >>>>>> would be better to write down exactly what buffers are supposed to >> go >> >>>>>> on the wire for each logical type. In the case of the dictionary >> >>>>>> vectors, it is the buffers for the indices, so the issue under >> >>>>>> discussion resolves itself if we nix the metadata. >> >>>>>> >> >>>>>> If writers are emitting possibly different buffer layouts (like >> >>>>>> omitting a null or zero-length buffer), it will introduce >> brittleness >> >>>>>> and cause much special casing to trickle down into the reader >> >>>>>> implementations. This seems like undue complexity. >> >>>>>> >> >>>>>> - Wes >> >>>>>> >> >>>>>> On Mon, Nov 6, 2017 at 9:33 AM, Brian Hulette < >> brian.hule...@ccri.com> >> >>>>> >> >>>>> wrote: >> >>>>>>> >> >>>>>>> We've been having some integration issues with reading Dictionary >> >>>>> >> >>>>> Vectors in >> >>>>>>> >> >>>>>>> the JS implementation - our current implementation can read arrow >> >>>>>>> files >> >>>>> >> >>>>> and >> >>>>>>> >> >>>>>>> streams generated by Java, but not by C++. Most of this >> discussion is >> >>>>>>> captured in ARROW-1693 [1]. >> >>>>>>> >> >>>>>>> It looks like ultimately the issue is that there are >> inconsistencies >> >>>>>>> in >> >>>>> >> >>>>> the >> >>>>>>> >> >>>>>>> way the various implementations handle buffer layouts for >> >>>>> >> >>>>> dictionary-encoded >> >>>>>>> >> >>>>>>> vectors in the Schema message. Some places write/read the buffer >> >>>>>>> layout >> >>>>> >> >>>>> for >> >>>>>>> >> >>>>>>> the value vector (the vector found in the dictionary batch), and >> >>>>>>> others >> >>>>>>> expect the layout for the index vector (the int vector found in >> the >> >>>>> >> >>>>> record >> >>>>>>> >> >>>>>>> batch). Both the Java and C++ IPC readers don't seem to care about >> >>>>>>> this >> >>>>>>> portion of the Schema, which explains why the integration tests >> are >> >>>>> >> >>>>> passing. >> >>>>>>> >> >>>>>>> Here's a fun ASCII table of how I think the Java/C++/JS IPC >> readers >> >>>>>>> and >> >>>>>>> writers handle those buffers layouts right now: >> >>>>>>> >> >>>>>>> | Writer | Reader >> >>>>>>> -----+--------------+------------- >> >>>>>>> Java | value vector | doesn't care >> >>>>>>> C++ | index vector | doesn't care >> >>>>>>> JS | N/A | value vector >> >>>>>>> >> >>>>>>> Note that I can only really speak with authority about the JS >> >>>>>>> implementation. I'd appreciate it if people more familiar with the >> >>>>> >> >>>>> other two >> >>>>>>> >> >>>>>>> could validate my claims. >> >>>>>>> >> >>>>>>> As far as I can tell the expected behavior isn't stated anywhere >> in >> >>>>>>> the >> >>>>>>> documentation, which I suppose explains the inconsistency. Paul >> >>>>>>> Taylor >> >>>>> >> >>>>> is >> >>>>>>> >> >>>>>>> currently working on resolving ARROW-1693 by making the JS reader >> >>>>> >> >>>>> ambivalent >> >>>>>>> >> >>>>>>> to buffer layout, but I think ultimately the correct solution is >> to >> >>>>> >> >>>>> agree on >> >>>>>>> >> >>>>>>> a consistent standard, and make the reader implementations >> >>>>>>> opinionated >> >>>>> >> >>>>> about >> >>>>>>> >> >>>>>>> the Schema buffer layouts (i.e. ARROW-1362 [2]). >> >>>>>>> >> >>>>>>> Personally, I don't really have an opinion either way about which >> >>>>> >> >>>>> vector's >> >>>>>>> >> >>>>>>> layout should be in the Schema. Either way we'll be missing some >> >>>>>>> layout >> >>>>>>> information though, so we should also consider where the >> information >> >>>>> >> >>>>> for the >> >>>>>>> >> >>>>>>> "other" vector might go. >> >>>>>>> >> >>>>>>> I know there's a release coming up, and now is probably not the >> time >> >>>>>>> to >> >>>>>>> tackle this problem, but I wanted to write it up while its fresh >> in >> >>>>>>> my >> >>>>> >> >>>>> mind. >> >>>>>>> >> >>>>>>> I'm fine shelving it until after 0.8. >> >>>>>>> >> >>>>>>> Brian >> >>>>>>> >> >>>>>>> [1] https://issues.apache.org/jira/browse/ARROW-1693 >> >>>>>>> [2] https://issues.apache.org/jira/browse/ARROW-1362 >> >>>>> >> >>>>> >> > >> > >