hi all, I agree with handling the interleaved dictionary case. While we are at it, I think we should formally allow dictionary replacements.
Note that I just opened https://github.com/apache/arrow/pull/5202 to revamp / consolidate the format documents. So any changes will need to be based on that. One problem with our integration testing strategy -- the JSON files don't exactly simulate a stream, they are more like the current File format. If we want to properly generate integration test cases, we should also slightly generalize the integration test JSON format to contain a collection of messages like [ {"message_type": "Schema", ... }, {"message_type": "DictionaryBatch", ... }, {"message_type": "RecordBatch", ... }, {"message_type": "RecordBatch", ... }, {"message_type": "DictionaryBatch", ... }, ... ] I don't know that it will be all that difficult to make this change but we would have 4 implementations that have to get fixed up. - Wes On Wed, Aug 21, 2019 at 10:58 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > Hi Ji Liu, > Thanks for getting the conversation started. I think a few things need to > happen: > 1. We need to clarify in the specification that not all dictionaries need > to be present at the beginning. I plan on creating a PR for discussion > that clarifies this point, as well as handling of non-delta dictionary > batches discussed earlier [1]. > 2. Java needs to support both of these once we vote on the > clarification. One way of doing this, is what I think Jacques alluded to in > [1]. For the in memory representation of dictionary encoding we will need > to not only track a notion of "dictionary" id for the Vector, but also an > addition > 3. Lastly, we should have integration tests around these cases to make > sure they are handled consistently. > > Thanks, > Micah > > [1] > https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E > > On Wed, Aug 21, 2019 at 7:48 AM Ji Liu <niki...@aliyun.com> wrote: > > > Hi all, > > > > Recently when we worked on fixing a IPC related bug in both Java/C++ > > sides[1][2], > > @emkornfieldfound that the stream reader assumes that all dictionaries > > are at the start of the stream which is inconsistent with spec[3] > > which says as long as a record batch doesn't reference a dictionary they > > can be interleaved. > > > > > > Cases below should be supported, however they will crash at current > > implementations. > > i. have a record batch of one dictionary encoded column S > > 1>Schema > > 2>RecordBatch: S=[null, null, null, null] > > 3>DictionaryBatch: ['abc', 'efg'] > > 4>Recordbatch: S=[0, 1, 0, 1] > > ii. have a record batch of two dictionary encoded column S1, S2 > > 1>Schema > > 2->DictionaryBatch S1: ['ab', 'cd'] > > 3->RecordBatch: S1 = [0,1,0,1] S2 =[null, null, null,] > > 4->DictionaryBatch S2: ['cc', 'dd'] > > 5->RecordBatch: S1 = [0,1,0,1] S2 =[0,1,0,1] > > > > We already did some work on Java side via[1] > > to make it possible to parse interleaved dictionaries and batches: > > i. In ArrowStreamReader, do not read all dictionaries at the start > > ii. > > When call loadNextBatch, we read message to decide read dictionaries first > > or directly read a batch, if former, read all dictionaries out before this > > batch. > > > > iii.When we read a batch, we check if the dictionaries it needed has > > already been read, if not, check if it's all null column and decide whether > > need throw exception. > > > > In this way, whatever they are interleaved or not, we can parse it properly. > > > > In the future, I think we should also support write interleaved > > dictionaries and batches in IPC stream(created an > > issue to track this[4]), but not quite clear how to implement this. > > Any opinions about this are appreciated, thanks! > > > > Thanks, > > Ji Liu > > > > [1] https://issues.apache.org/jira/browse/ARROW-6040, > > [2] https://issues.apache.org/jira/browse/ARROW-6126, > > [3] http://arrow.apache.org/docs/format/IPC.html#streaming-format, > > [4]https://issues.apache.org/jira/browse/ARROW-6308 > >