Hi all,

Recently when we worked on fixing a IPC related bug in both Java/C++ 
sides[1][2],  @emkornfieldfound that the stream reader assumes that all 
dictionaries are at the start of the stream which is inconsistent with  spec[3] 
which says as long as a record batch doesn't reference a dictionary they can be 
interleaved.

Cases below should be supported, however they will crash at current 
implementations.
i. have a record batch of one dictionary encoded column S
 1>Schema
 2>RecordBatch: S=[null, null, null, null]
 3>DictionaryBatch: ['abc', 'efg']
 4>Recordbatch: S=[0, 1, 0, 1]
ii. have a record batch of two dictionary encoded column S1, S2
 1>Schema
 2->DictionaryBatch S1: ['ab', 'cd']
 3->RecordBatch: S1 = [0,1,0,1] S2 =[null, null, null,]
 4->DictionaryBatch S2: ['cc', 'dd']
 5->RecordBatch: S1 = [0,1,0,1] S2 =[0,1,0,1]

We already did some work on Java side via[1] to make it possible to parse 
interleaved dictionaries and batches:
 i. In ArrowStreamReader, do not read all dictionaries at the start
 ii. When call loadNextBatch, we read message to decide read dictionaries first 
or directly read a batch, if former, read all dictionaries out before this 
batch.
 iii.When we read a batch, we check if the dictionaries it needed has already 
been read, if not, check if it's all null column and decide whether need throw 
exception.
In this way, whatever they are interleaved or not, we can parse it properly.

In the future, I think we should also support write interleaved dictionaries 
and batches in IPC stream(created an issue to track this[4]), but not quite 
clear how to implement this.
Any opinions about this are appreciated, thanks!

Thanks,
Ji Liu

[1] https://issues.apache.org/jira/browse/ARROW-6040,
[2] https://issues.apache.org/jira/browse/ARROW-6126,
[3] 
http://arrow.apache.org/docs/format/IPC.html#streaming-format,[4]https://issues.apache.org/jira/browse/ARROW-6308

Reply via email to