Hi Ji Liu,
Thanks for getting the conversation started.  I think a few things need to
happen:
1.  We need to clarify in the specification that not all dictionaries need
to be present at the beginning.  I plan on creating a PR for discussion
that clarifies this point, as well as handling of non-delta dictionary
batches discussed earlier [1].
2.  Java needs to support both of these once we vote on the
clarification. One way of doing this, is what I think Jacques alluded to in
[1].  For the in memory representation of dictionary encoding we will need
to not only track a notion of "dictionary" id for the Vector, but also an
addition
3.  Lastly, we should have integration tests around these cases to make
sure they are handled consistently.

Thanks,
Micah

[1]
https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E

On Wed, Aug 21, 2019 at 7:48 AM Ji Liu <niki...@aliyun.com> wrote:

> Hi all,
>
> Recently when we worked on fixing a IPC related bug in both Java/C++ 
> sides[1][2],
> @emkornfieldfound that the stream reader assumes that all dictionaries
> are at the start of the stream which is inconsistent with  spec[3]
> which says as long as a record batch doesn't reference a dictionary they can 
> be interleaved.
>
>
> Cases below should be supported, however they will crash at current 
> implementations.
> i. have a record batch of one dictionary encoded column S
>  1>Schema
>  2>RecordBatch: S=[null, null, null, null]
>  3>DictionaryBatch: ['abc', 'efg']
>  4>Recordbatch: S=[0, 1, 0, 1]
> ii. have a record batch of two dictionary encoded column S1, S2
>  1>Schema
>  2->DictionaryBatch S1: ['ab', 'cd']
>  3->RecordBatch: S1 = [0,1,0,1] S2 =[null, null, null,]
>  4->DictionaryBatch S2: ['cc', 'dd']
>  5->RecordBatch: S1 = [0,1,0,1] S2 =[0,1,0,1]
>
> We already did some work on Java side via[1]
> to make it possible to parse interleaved dictionaries and batches:
>  i. In ArrowStreamReader, do not read all dictionaries at the start
>  ii.
> When call loadNextBatch, we read message to decide read dictionaries first or 
> directly read a batch, if former, read all dictionaries out before this batch.
>
>  iii.When we read a batch, we check if the dictionaries it needed has already 
> been read, if not, check if it's all null column and decide whether need 
> throw exception.
>
> In this way, whatever they are interleaved or not, we can parse it properly.
>
> In the future, I think we should also support write interleaved dictionaries 
> and batches in IPC stream(created an
> issue to track this[4]), but not quite clear how to implement this.
> Any opinions about this are appreciated, thanks!
>
> Thanks,
> Ji Liu
>
> [1] https://issues.apache.org/jira/browse/ARROW-6040,
> [2] https://issues.apache.org/jira/browse/ARROW-6126,
> [3] http://arrow.apache.org/docs/format/IPC.html#streaming-format,
> [4]https://issues.apache.org/jira/browse/ARROW-6308
>

Reply via email to