Just bumping this thread for more comments
On Wed, Oct 30, 2019 at 3:11 PM Wes McKinney <wesmck...@gmail.com> wrote: > > Returning to this discussion as there seems to lack consensus in the vote > thread > > Copying Micah's proposals in the VOTE thread here, I wanted to state > my opinions so we can discuss further and see where there is potential > disagreement > > 1. It is not required that all dictionary batches occur at the beginning > of the IPC stream format (if a the first record batch has an all null > dictionary encoded column, the null column's dictionary might not be sent > until later in the stream). > > This seems preferable to requiring a placeholder empty dictionary > batch. This does mean more to test but the integration tests will > force the issue > > 2. A second dictionary batch for the same ID that is not a "delta batch" > in an IPC stream indicates the dictionary should be replaced. > > Agree. > > 3. Clarifies that the file format, can only contain 1 "NON-delta" > dictionary batch and multiple "delta" dictionary batches. > > Agree -- it is also worth stating explicitly that dictionary > replacements are not allowed in the file format. > > In the file format, all the dictionaries must be "loaded" up front. > The code path for loading the dictionaries ideally should use nearly > the same code as the stream-reader code that sees follow-up dictionary > batches interspersed in the stream. The only downside is that it will > not be possible to exactly preserve the dictionary "state" as of each > record batch being written. > > So if we had a file containing > > DICTIONARY ID=0 > RECORD BATCH > RECORD BATCH > DICTIONARY DELTA ID=0 > RECORD BATCH > RECORD BATCH > > Then after processing/loading the dictionaries, the first two record > batches will have a dictionary that is "larger" (on account of the > delta) than when they were written. Since dictionaries are > fundamentally about data representation, they still represent the same > data so I think this is acceptable. > > 4. Add an enum to dictionary metadata for possible future changes in what > format dictionary batches can be sent. (the most likely would be an array > Map<Int, Value>). An enum is needed as a place holder to allow for forward > compatibility past the release 1.0.0. > > I'm least sure about this but I do not think it is harmful to have a > forward-compatible "escape hatch" for future evolutions in dictionary > encoding. > > On Wed, Oct 16, 2019 at 2:57 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > I'll plan on starting a vote in the next day or two if there are no further > > objections/comments. > > > > On Sun, Oct 13, 2019 at 11:06 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > I think the only point asked on the PR that I think is worth discussing is > > > assumptions about dictionaries at the beginning of streams. > > > > > > There are two options: > > > 1. Based on the current wording, it does not seem that all dictionaries > > > need to be at the beginning of the stream if they aren't made use of in > > > the > > > first record batch (i.e. a dictionary encoded column is all null in the > > > first record batch). > > > 2. We require a dictionary batch for each dictionary at the beginning of > > > the stream (and require implementations to send an empty batch if they > > > don't have the dictionary available). > > > > > > The current proposal in the PR is option #1. > > > > > > Thanks, > > > Micah > > > > > > On Sat, Oct 5, 2019 at 4:01 PM Micah Kornfield <emkornfi...@gmail.com> > > > wrote: > > > > > >> I've opened a pull request [1] to clarify some recent conversations about > > >> semantics/edge cases for dictionary encoding [2][3] around interleaved > > >> batches and when isDelta=False. > > >> > > >> Specifically, it proposes isDelta=False indicates dictionary > > >> replacement. For the file format, only one isDelta=False batch is > > >> allowed > > >> per file and isDelta=true batches are applied in the order supplied file > > >> footer. > > >> > > >> In addition, I've added a new enum to DictionaryEncoding to preserve > > >> future compatibility in case we want to expand dictionary encoding to be > > >> an > > >> explicit mapping from "ID" to "VALUE" as discussed in [4]. > > >> > > >> Once people have had a change to review and come to a consensus. I will > > >> call a formal vote to approve the change commit the change. > > >> > > >> Thanks, > > >> Micah > > >> > > >> [1] https://github.com/apache/arrow/pull/5585 > > >> [2] > > >> https://lists.apache.org/thread.html/9734b71bc12aca16eb997388e95105bff412fdaefa4e19422f477389@%3Cdev.arrow.apache.org%3E > > >> [3] > > >> https://lists.apache.org/thread.html/5c3c9346101df8d758e24664638e8ada0211d310ab756a89cde3786a@%3Cdev.arrow.apache.org%3E > > >> [4] > > >> https://lists.apache.org/thread.html/15a4810589b2eb772bce5b2372970d9d93badbd28999a1bbe2af418a@%3Cdev.arrow.apache.org%3E > > >> > > >>