hi all,

I agree with handling the interleaved dictionary case. While we are at
it, I think we should formally allow dictionary replacements.

Note that I just opened

https://github.com/apache/arrow/pull/5202

to revamp / consolidate the format documents. So any changes will need
to be based on that.

One problem with our integration testing strategy -- the JSON files
don't exactly simulate a stream, they are more like the current File
format. If we want to properly generate integration test cases, we
should also slightly generalize the integration test JSON format to
contain a collection of messages like

[
  {"message_type": "Schema", ... },
  {"message_type": "DictionaryBatch", ... },
  {"message_type": "RecordBatch", ... },
  {"message_type": "RecordBatch", ... },
  {"message_type": "DictionaryBatch", ... },
  ...
]

I don't know that it will be all that difficult to make this change
but we would have 4 implementations that have to get fixed up.

- Wes

On Wed, Aug 21, 2019 at 10:58 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Hi Ji Liu,
> Thanks for getting the conversation started.  I think a few things need to
> happen:
> 1.  We need to clarify in the specification that not all dictionaries need
> to be present at the beginning.  I plan on creating a PR for discussion
> that clarifies this point, as well as handling of non-delta dictionary
> batches discussed earlier [1].
> 2.  Java needs to support both of these once we vote on the
> clarification. One way of doing this, is what I think Jacques alluded to in
> [1].  For the in memory representation of dictionary encoding we will need
> to not only track a notion of "dictionary" id for the Vector, but also an
> addition
> 3.  Lastly, we should have integration tests around these cases to make
> sure they are handled consistently.
>
> Thanks,
> Micah
>
> [1]
> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
>
> On Wed, Aug 21, 2019 at 7:48 AM Ji Liu <niki...@aliyun.com> wrote:
>
> > Hi all,
> >
> > Recently when we worked on fixing a IPC related bug in both Java/C++ 
> > sides[1][2],
> > @emkornfieldfound that the stream reader assumes that all dictionaries
> > are at the start of the stream which is inconsistent with  spec[3]
> > which says as long as a record batch doesn't reference a dictionary they 
> > can be interleaved.
> >
> >
> > Cases below should be supported, however they will crash at current 
> > implementations.
> > i. have a record batch of one dictionary encoded column S
> >  1>Schema
> >  2>RecordBatch: S=[null, null, null, null]
> >  3>DictionaryBatch: ['abc', 'efg']
> >  4>Recordbatch: S=[0, 1, 0, 1]
> > ii. have a record batch of two dictionary encoded column S1, S2
> >  1>Schema
> >  2->DictionaryBatch S1: ['ab', 'cd']
> >  3->RecordBatch: S1 = [0,1,0,1] S2 =[null, null, null,]
> >  4->DictionaryBatch S2: ['cc', 'dd']
> >  5->RecordBatch: S1 = [0,1,0,1] S2 =[0,1,0,1]
> >
> > We already did some work on Java side via[1]
> > to make it possible to parse interleaved dictionaries and batches:
> >  i. In ArrowStreamReader, do not read all dictionaries at the start
> >  ii.
> > When call loadNextBatch, we read message to decide read dictionaries first 
> > or directly read a batch, if former, read all dictionaries out before this 
> > batch.
> >
> >  iii.When we read a batch, we check if the dictionaries it needed has 
> > already been read, if not, check if it's all null column and decide whether 
> > need throw exception.
> >
> > In this way, whatever they are interleaved or not, we can parse it properly.
> >
> > In the future, I think we should also support write interleaved 
> > dictionaries and batches in IPC stream(created an
> > issue to track this[4]), but not quite clear how to implement this.
> > Any opinions about this are appreciated, thanks!
> >
> > Thanks,
> > Ji Liu
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-6040,
> > [2] https://issues.apache.org/jira/browse/ARROW-6126,
> > [3] http://arrow.apache.org/docs/format/IPC.html#streaming-format,
> > [4]https://issues.apache.org/jira/browse/ARROW-6308
> >

Reply via email to