Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Wes McKinney Wed, 06 Nov 2019 09:25:32 -0800

Just bumping this thread for more comments


On Wed, Oct 30, 2019 at 3:11 PM Wes McKinney <[email protected]> wrote:
>
> Returning to this discussion as there seems to lack consensus in the vote 
> thread
>
> Copying Micah's proposals in the VOTE thread here, I wanted to state
> my opinions so we can discuss further and see where there is potential
> disagreement
>
> 1.  It is not required that all dictionary batches occur at the beginning
> of the IPC stream format (if a the first record batch has an all null
> dictionary encoded column, the null column's dictionary might not be sent
> until later in the stream).
>
> This seems preferable to requiring a placeholder empty dictionary
> batch. This does mean more to test but the integration tests will
> force the issue
>
> 2.  A second dictionary batch for the same ID that is not a "delta batch"
> in an IPC stream indicates the dictionary should be replaced.
>
> Agree.
>
> 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> dictionary batch and multiple "delta" dictionary batches.
>
> Agree -- it is also worth stating explicitly that dictionary
> replacements are not allowed in the file format.
>
> In the file format, all the dictionaries must be "loaded" up front.
> The code path for loading the dictionaries ideally should use nearly
> the same code as the stream-reader code that sees follow-up dictionary
> batches interspersed in the stream. The only downside is that it will
> not be possible to exactly preserve the dictionary "state" as of each
> record batch being written.
>
> So if we had a file containing
>
> DICTIONARY ID=0
> RECORD BATCH
> RECORD BATCH
> DICTIONARY DELTA ID=0
> RECORD BATCH
> RECORD BATCH
>
> Then after processing/loading the dictionaries, the first two record
> batches will have a dictionary that is "larger" (on account of the
> delta) than when they were written. Since dictionaries are
> fundamentally about data representation, they still represent the same
> data so I think this is acceptable.
>
> 4.  Add an enum to dictionary metadata for possible future changes in what
> format dictionary batches can be sent. (the most likely would be an array
> Map<Int, Value>).  An enum is needed as a place holder to allow for forward
> compatibility past the release 1.0.0.
>
> I'm least sure about this but I do not think it is harmful to have a
> forward-compatible "escape hatch" for future evolutions in dictionary
> encoding.
>
> On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <[email protected]> wrote:
> >
> > I'll plan on starting a vote in the next day or two if there are no further
> > objections/comments.
> >
> > On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield <[email protected]>
> > wrote:
> >
> > > I think the only point asked on the PR that I think is worth discussing is
> > > assumptions about dictionaries at the beginning of streams.
> > >
> > > There are two options:
> > > 1.  Based on the current wording, it does not seem that all dictionaries
> > > need to be at the beginning of the stream if they aren't made use of in 
> > > the
> > > first record batch (i.e. a dictionary encoded column is all null in the
> > > first record batch).
> > > 2.  We require a dictionary batch for each dictionary at the beginning of
> > > the stream (and require implementations to send an empty batch if they
> > > don't have the dictionary available).
> > >
> > > The current proposal in the PR is option #1.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield <[email protected]>
> > > wrote:
> > >
> > >> I've opened a pull request [1] to clarify some recent conversations about
> > >> semantics/edge cases for dictionary encoding [2][3] around interleaved
> > >> batches and when isDelta=False.
> > >>
> > >> Specifically, it proposes isDelta=False indicates dictionary
> > >> replacement.  For the file format, only one isDelta=False batch is 
> > >> allowed
> > >> per file and isDelta=true batches are applied in the order supplied file
> > >> footer.
> > >>
> > >> In addition, I've added a new enum to DictionaryEncoding to preserve
> > >> future compatibility in case we want to expand dictionary encoding to be 
> > >> an
> > >> explicit mapping from "ID" to "VALUE" as discussed in [4].
> > >>
> > >> Once people have had a change to review and come to a consensus. I will
> > >> call a formal vote to approve the change commit the change.
> > >>
> > >> Thanks,
> > >> Micah
> > >>
> > >> [1] https://github.com/apache/arrow/pull/5585
> > >> [2]
> > >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
> > >> [3]
> > >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E
> > >> [4]
> > >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E
> > >>
> > >>

Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Reply via email to