Hi Ji Liu, Thanks for getting the conversation started. I think a few things need to happen: 1. We need to clarify in the specification that not all dictionaries need to be present at the beginning. I plan on creating a PR for discussion that clarifies this point, as well as handling of non-delta dictionary batches discussed earlier [1]. 2. Java needs to support both of these once we vote on the clarification. One way of doing this, is what I think Jacques alluded to in [1]. For the in memory representation of dictionary encoding we will need to not only track a notion of "dictionary" id for the Vector, but also an addition 3. Lastly, we should have integration tests around these cases to make sure they are handled consistently.
Thanks, Micah [1] https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E On Wed, Aug 21, 2019 at 7:48 AM Ji Liu <niki...@aliyun.com> wrote: > Hi all, > > Recently when we worked on fixing a IPC related bug in both Java/C++ > sides[1][2], > @emkornfieldfound that the stream reader assumes that all dictionaries > are at the start of the stream which is inconsistent with spec[3] > which says as long as a record batch doesn't reference a dictionary they can > be interleaved. > > > Cases below should be supported, however they will crash at current > implementations. > i. have a record batch of one dictionary encoded column S > 1>Schema > 2>RecordBatch: S=[null, null, null, null] > 3>DictionaryBatch: ['abc', 'efg'] > 4>Recordbatch: S=[0, 1, 0, 1] > ii. have a record batch of two dictionary encoded column S1, S2 > 1>Schema > 2->DictionaryBatch S1: ['ab', 'cd'] > 3->RecordBatch: S1 = [0,1,0,1] S2 =[null, null, null,] > 4->DictionaryBatch S2: ['cc', 'dd'] > 5->RecordBatch: S1 = [0,1,0,1] S2 =[0,1,0,1] > > We already did some work on Java side via[1] > to make it possible to parse interleaved dictionaries and batches: > i. In ArrowStreamReader, do not read all dictionaries at the start > ii. > When call loadNextBatch, we read message to decide read dictionaries first or > directly read a batch, if former, read all dictionaries out before this batch. > > iii.When we read a batch, we check if the dictionaries it needed has already > been read, if not, check if it's all null column and decide whether need > throw exception. > > In this way, whatever they are interleaved or not, we can parse it properly. > > In the future, I think we should also support write interleaved dictionaries > and batches in IPC stream(created an > issue to track this[4]), but not quite clear how to implement this. > Any opinions about this are appreciated, thanks! > > Thanks, > Ji Liu > > [1] https://issues.apache.org/jira/browse/ARROW-6040, > [2] https://issues.apache.org/jira/browse/ARROW-6126, > [3] http://arrow.apache.org/docs/format/IPC.html#streaming-format, > [4]https://issues.apache.org/jira/browse/ARROW-6308 >