Ok, anything else do discuss? Otherwise I'll plan on a new vote with the original language + an explicit call-out that dictionary replacement isn't supported for the file format in the PR
On Thursday, November 14, 2019, Antoine Pitrou <anto...@python.org> wrote: > > Right. The dictionaries can be found from the file footer, so it seems ok. > > Thank you > > Regards > > Antoine. > > > Le 14/11/2019 à 07:11, Micah Kornfield a écrit : > > I'll add for: > > > > If so, how does this play with the fact that there potentially are delta > >> dictionaries in the "stream"? > > > > That in this case the important feature is the dictionary batches have an > > explicit ordering in the file format based on metadata. So their > ordering > > in the "stream" is largely irrelevant. As Wes pointed out the most > > convenient implementation for this would have to load all dictionary > > batches before doing random access (and would be very similar to the > stream > > code). > > > > Does this make sense? > > > > > > On Tue, Nov 12, 2019 at 2:01 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > >> Hi Antoine, > >> > >> Each *record batch* is intended to be readable in random order. To read > any > >> record batch requires loading the dictionaries indicated in the schema, > so > >> appending the deltas as part of this process does not seem like it would > >> introduce hardship given that such logic is needed to properly handle > the > >> stream format. Dictionary replacements in the file format (at least as > >> currently conceived) does not seem possible. > >> > >> > >> On Tue, Nov 12, 2019, 10:13 AM Antoine Pitrou <anto...@python.org> > wrote: > >> > >>> > >>> Hi, > >>> > >>> Sorry for the delay. > >>> > >>> My high-level question is the following: is the file format intended > to > >>> be readable in random order (rather than having to read through it in > >>> sequence as with the stream format)? If so, how does this play with > the > >>> fact that there potentially are delta dictionaries in the "stream"? > >>> > >>> Regards > >>> > >>> Antoine. > >>> > >>> > >>> Le 30/10/2019 à 21:11, Wes McKinney a écrit : > >>>> Returning to this discussion as there seems to lack consensus in the > >>> vote thread > >>>> > >>>> Copying Micah's proposals in the VOTE thread here, I wanted to state > >>>> my opinions so we can discuss further and see where there is potential > >>>> disagreement > >>>> > >>>> 1. It is not required that all dictionary batches occur at the > >> beginning > >>>> of the IPC stream format (if a the first record batch has an all null > >>>> dictionary encoded column, the null column's dictionary might not be > >> sent > >>>> until later in the stream). > >>>> > >>>> This seems preferable to requiring a placeholder empty dictionary > >>>> batch. This does mean more to test but the integration tests will > >>>> force the issue > >>>> > >>>> 2. A second dictionary batch for the same ID that is not a "delta > >> batch" > >>>> in an IPC stream indicates the dictionary should be replaced. > >>>> > >>>> Agree. > >>>> > >>>> 3. Clarifies that the file format, can only contain 1 "NON-delta" > >>>> dictionary batch and multiple "delta" dictionary batches. > >>>> > >>>> Agree -- it is also worth stating explicitly that dictionary > >>>> replacements are not allowed in the file format. > >>>> > >>>> In the file format, all the dictionaries must be "loaded" up front. > >>>> The code path for loading the dictionaries ideally should use nearly > >>>> the same code as the stream-reader code that sees follow-up dictionary > >>>> batches interspersed in the stream. The only downside is that it will > >>>> not be possible to exactly preserve the dictionary "state" as of each > >>>> record batch being written. > >>>> > >>>> So if we had a file containing > >>>> > >>>> DICTIONARY ID=0 > >>>> RECORD BATCH > >>>> RECORD BATCH > >>>> DICTIONARY DELTA ID=0 > >>>> RECORD BATCH > >>>> RECORD BATCH > >>>> > >>>> Then after processing/loading the dictionaries, the first two record > >>>> batches will have a dictionary that is "larger" (on account of the > >>>> delta) than when they were written. Since dictionaries are > >>>> fundamentally about data representation, they still represent the same > >>>> data so I think this is acceptable. > >>>> > >>>> 4. Add an enum to dictionary metadata for possible future changes in > >>> what > >>>> format dictionary batches can be sent. (the most likely would be an > >> array > >>>> Map<Int, Value>). An enum is needed as a place holder to allow for > >>> forward > >>>> compatibility past the release 1.0.0. > >>>> > >>>> I'm least sure about this but I do not think it is harmful to have a > >>>> forward-compatible "escape hatch" for future evolutions in dictionary > >>>> encoding. > >>>> > >>>> On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield < > emkornfi...@gmail.com > >>> > >>> wrote: > >>>>> > >>>>> I'll plan on starting a vote in the next day or two if there are no > >>> further > >>>>> objections/comments. > >>>>> > >>>>> On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield < > >> emkornfi...@gmail.com > >>>> > >>>>> wrote: > >>>>> > >>>>>> I think the only point asked on the PR that I think is worth > >>> discussing is > >>>>>> assumptions about dictionaries at the beginning of streams. > >>>>>> > >>>>>> There are two options: > >>>>>> 1. Based on the current wording, it does not seem that all > >>> dictionaries > >>>>>> need to be at the beginning of the stream if they aren't made use of > >>> in the > >>>>>> first record batch (i.e. a dictionary encoded column is all null in > >> the > >>>>>> first record batch). > >>>>>> 2. We require a dictionary batch for each dictionary at the > >> beginning > >>> of > >>>>>> the stream (and require implementations to send an empty batch if > >> they > >>>>>> don't have the dictionary available). > >>>>>> > >>>>>> The current proposal in the PR is option #1. > >>>>>> > >>>>>> Thanks, > >>>>>> Micah > >>>>>> > >>>>>> On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield < > >> emkornfi...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>>> I've opened a pull request [1] to clarify some recent conversations > >>> about > >>>>>>> semantics/edge cases for dictionary encoding [2][3] around > >> interleaved > >>>>>>> batches and when isDelta=False. > >>>>>>> > >>>>>>> Specifically, it proposes isDelta=False indicates dictionary > >>>>>>> replacement. For the file format, only one isDelta=False batch is > >>> allowed > >>>>>>> per file and isDelta=true batches are applied in the order supplied > >>> file > >>>>>>> footer. > >>>>>>> > >>>>>>> In addition, I've added a new enum to DictionaryEncoding to > preserve > >>>>>>> future compatibility in case we want to expand dictionary encoding > >> to > >>> be an > >>>>>>> explicit mapping from "ID" to "VALUE" as discussed in [4]. > >>>>>>> > >>>>>>> Once people have had a change to review and come to a consensus. I > >>> will > >>>>>>> call a formal vote to approve the change commit the change. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Micah > >>>>>>> > >>>>>>> [1] https://github.com/apache/arrow/pull/5585 > >>>>>>> [2] > >>>>>>> > >>> > >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105 > bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E > >>>>>>> [3] > >>>>>>> > >>> > >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8a > da0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E > >>>>>>> [4] > >>>>>>> > >>> > >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d > 9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E > >>>>>>> > >>>>>>> > >>> > >> > > >