Hi all, Recently when we worked on fixing a IPC related bug in both Java/C++ sides[1][2], @emkornfieldfound that the stream reader assumes that all dictionaries are at the start of the stream which is inconsistent with spec[3] which says as long as a record batch doesn't reference a dictionary they can be interleaved.
Cases below should be supported, however they will crash at current implementations. i. have a record batch of one dictionary encoded column S 1>Schema 2>RecordBatch: S=[null, null, null, null] 3>DictionaryBatch: ['abc', 'efg'] 4>Recordbatch: S=[0, 1, 0, 1] ii. have a record batch of two dictionary encoded column S1, S2 1>Schema 2->DictionaryBatch S1: ['ab', 'cd'] 3->RecordBatch: S1 = [0,1,0,1] S2 =[null, null, null,] 4->DictionaryBatch S2: ['cc', 'dd'] 5->RecordBatch: S1 = [0,1,0,1] S2 =[0,1,0,1] We already did some work on Java side via[1] to make it possible to parse interleaved dictionaries and batches: i. In ArrowStreamReader, do not read all dictionaries at the start ii. When call loadNextBatch, we read message to decide read dictionaries first or directly read a batch, if former, read all dictionaries out before this batch. iii.When we read a batch, we check if the dictionaries it needed has already been read, if not, check if it's all null column and decide whether need throw exception. In this way, whatever they are interleaved or not, we can parse it properly. In the future, I think we should also support write interleaved dictionaries and batches in IPC stream(created an issue to track this[4]), but not quite clear how to implement this. Any opinions about this are appreciated, thanks! Thanks, Ji Liu [1] https://issues.apache.org/jira/browse/ARROW-6040, [2] https://issues.apache.org/jira/browse/ARROW-6126, [3] http://arrow.apache.org/docs/format/IPC.html#streaming-format,[4]https://issues.apache.org/jira/browse/ARROW-6308