Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Micah Kornfield Thu, 14 Nov 2019 08:55:07 -0800

Ok, anything else do discuss?  Otherwise I'll plan on a new vote with the
original language + an explicit call-out that dictionary replacement isn't
supported for the file format in the PR


On Thursday, November 14, 2019, Antoine Pitrou <anto...@python.org> wrote:

>
> Right.  The dictionaries can be found from the file footer, so it seems ok.
>
> Thank you
>
> Regards
>
> Antoine.
>
>
> Le 14/11/2019 à 07:11, Micah Kornfield a écrit :
> > I'll add for:
> >
> > If so, how does this play with the fact that there potentially are delta
> >> dictionaries in the "stream"?
> >
> > That in this case the important feature is the dictionary batches have an
> > explicit ordering in the file format based on metadata.  So their
> ordering
> > in the "stream" is largely irrelevant.  As Wes pointed out the most
> > convenient implementation for this would have to load all dictionary
> > batches before doing random access (and would be very similar to the
> stream
> > code).
> >
> > Does this make sense?
> >
> >
> > On Tue, Nov 12, 2019 at 2:01 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> >> Hi Antoine,
> >>
> >> Each *record batch* is intended to be readable in random order. To read
> any
> >> record batch requires loading the dictionaries indicated in the schema,
> so
> >> appending the deltas as part of this process does not seem like it would
> >> introduce hardship given that such logic is needed to properly handle
> the
> >> stream format. Dictionary replacements in the file format (at least as
> >> currently conceived) does not seem possible.
> >>
> >>
> >> On Tue, Nov 12, 2019, 10:13 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>> Sorry for the delay.
> >>>
> >>> My high-level question is the following:  is the file format intended
> to
> >>> be readable in random order (rather than having to read through it in
> >>> sequence as with the stream format)?  If so, how does this play with
> the
> >>> fact that there potentially are delta dictionaries in the "stream"?
> >>>
> >>> Regards
> >>>
> >>> Antoine.
> >>>
> >>>
> >>> Le 30/10/2019 à 21:11, Wes McKinney a écrit :
> >>>> Returning to this discussion as there seems to lack consensus in the
> >>> vote thread
> >>>>
> >>>> Copying Micah's proposals in the VOTE thread here, I wanted to state
> >>>> my opinions so we can discuss further and see where there is potential
> >>>> disagreement
> >>>>
> >>>> 1.  It is not required that all dictionary batches occur at the
> >> beginning
> >>>> of the IPC stream format (if a the first record batch has an all null
> >>>> dictionary encoded column, the null column's dictionary might not be
> >> sent
> >>>> until later in the stream).
> >>>>
> >>>> This seems preferable to requiring a placeholder empty dictionary
> >>>> batch. This does mean more to test but the integration tests will
> >>>> force the issue
> >>>>
> >>>> 2.  A second dictionary batch for the same ID that is not a "delta
> >> batch"
> >>>> in an IPC stream indicates the dictionary should be replaced.
> >>>>
> >>>> Agree.
> >>>>
> >>>> 3.  Clarifies that the file format, can only contain 1 "NON-delta"
> >>>> dictionary batch and multiple "delta" dictionary batches.
> >>>>
> >>>> Agree -- it is also worth stating explicitly that dictionary
> >>>> replacements are not allowed in the file format.
> >>>>
> >>>> In the file format, all the dictionaries must be "loaded" up front.
> >>>> The code path for loading the dictionaries ideally should use nearly
> >>>> the same code as the stream-reader code that sees follow-up dictionary
> >>>> batches interspersed in the stream. The only downside is that it will
> >>>> not be possible to exactly preserve the dictionary "state" as of each
> >>>> record batch being written.
> >>>>
> >>>> So if we had a file containing
> >>>>
> >>>> DICTIONARY ID=0
> >>>> RECORD BATCH
> >>>> RECORD BATCH
> >>>> DICTIONARY DELTA ID=0
> >>>> RECORD BATCH
> >>>> RECORD BATCH
> >>>>
> >>>> Then after processing/loading the dictionaries, the first two record
> >>>> batches will have a dictionary that is "larger" (on account of the
> >>>> delta) than when they were written. Since dictionaries are
> >>>> fundamentally about data representation, they still represent the same
> >>>> data so I think this is acceptable.
> >>>>
> >>>> 4.  Add an enum to dictionary metadata for possible future changes in
> >>> what
> >>>> format dictionary batches can be sent. (the most likely would be an
> >> array
> >>>> Map<Int, Value>).  An enum is needed as a place holder to allow for
> >>> forward
> >>>> compatibility past the release 1.0.0.
> >>>>
> >>>> I'm least sure about this but I do not think it is harmful to have a
> >>>> forward-compatible "escape hatch" for future evolutions in dictionary
> >>>> encoding.
> >>>>
> >>>> On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <
> emkornfi...@gmail.com
> >>>
> >>> wrote:
> >>>>>
> >>>>> I'll plan on starting a vote in the next day or two if there are no
> >>> further
> >>>>> objections/comments.
> >>>>>
> >>>>> On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield <
> >> emkornfi...@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> I think the only point asked on the PR that I think is worth
> >>> discussing is
> >>>>>> assumptions about dictionaries at the beginning of streams.
> >>>>>>
> >>>>>> There are two options:
> >>>>>> 1.  Based on the current wording, it does not seem that all
> >>> dictionaries
> >>>>>> need to be at the beginning of the stream if they aren't made use of
> >>> in the
> >>>>>> first record batch (i.e. a dictionary encoded column is all null in
> >> the
> >>>>>> first record batch).
> >>>>>> 2.  We require a dictionary batch for each dictionary at the
> >> beginning
> >>> of
> >>>>>> the stream (and require implementations to send an empty batch if
> >> they
> >>>>>> don't have the dictionary available).
> >>>>>>
> >>>>>> The current proposal in the PR is option #1.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Micah
> >>>>>>
> >>>>>> On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield <
> >> emkornfi...@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I've opened a pull request [1] to clarify some recent conversations
> >>> about
> >>>>>>> semantics/edge cases for dictionary encoding [2][3] around
> >> interleaved
> >>>>>>> batches and when isDelta=False.
> >>>>>>>
> >>>>>>> Specifically, it proposes isDelta=False indicates dictionary
> >>>>>>> replacement.  For the file format, only one isDelta=False batch is
> >>> allowed
> >>>>>>> per file and isDelta=true batches are applied in the order supplied
> >>> file
> >>>>>>> footer.
> >>>>>>>
> >>>>>>> In addition, I've added a new enum to DictionaryEncoding to
> preserve
> >>>>>>> future compatibility in case we want to expand dictionary encoding
> >> to
> >>> be an
> >>>>>>> explicit mapping from "ID" to "VALUE" as discussed in [4].
> >>>>>>>
> >>>>>>> Once people have had a change to review and come to a consensus. I
> >>> will
> >>>>>>> call a formal vote to approve the change commit the change.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Micah
> >>>>>>>
> >>>>>>> [1] https://github.com/apache/arrow/pull/5585
> >>>>>>> [2]
> >>>>>>>
> >>>
> >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105
> bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E
> >>>>>>> [3]
> >>>>>>>
> >>>
> >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8a
> da0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E
> >>>>>>> [4]
> >>>>>>>
> >>>
> >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d
> 9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E
> >>>>>>>
> >>>>>>>
> >>>
> >>
> >
>

Re: [DISCUSS] Dictionary Encoding Clarifications/Future Proofing

Reply via email to