On Tue, Aug 27, 2019 at 6:05 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > I was thinking the file format must satisfy one of two conditions: > 1. Exactly one dictionarybatch per encoded column > 2. DictionaryBatches are interleaved correctly.
Could you clarify? In the first case, there is no issue with dictionary replacements. I'm not sure about the second case -- if a dictionary id appears twice, then you'll see it twice in the file footer. I suppose you could look at the file offsets to determine whether a dictionary batch precedes a particular record batch block (to know which dictionary you should be using), but that's rather complicated to implement. It might be better to disallow replacements in the file format (which does introduce semantic slippage between the file and stream formats as Antoine was saying). > > On Tuesday, August 27, 2019, Wes McKinney <wesmck...@gmail.com> wrote: > > > On Tue, Aug 27, 2019 at 3:55 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > > > > > Le 27/08/2019 à 22:31, Wes McKinney a écrit : > > > > So the current situation we have right now in C++ is that if we tried > > > > to create an IPC stream from a sequence of record batches that don't > > > > all have the same dictionary, we'd run into two scenarios: > > > > > > > > * Batches that either have a prefix of a prior-observed dictionary, or > > > > the prior dictionary is a prefix of their dictionary. For example, > > > > suppose that the dictionary sent for an id was ['A', 'B', 'C'] and > > > > then there's a subsequent batch with ['A', 'B', 'C', 'D', 'E']. In > > > > such case we could compute and send a delta batch > > > > > > > > * Batches with a dictionary that is a permutation of values, and > > > > possibly new unique values. > > > > > > > > In this latter case, without the option of replacing an existing ID in > > > > the stream, we would have to do a unification / permutation of indices > > > > and then also possibly send a delta batch. We should probably have > > > > code at some point that deals with both cases, but in the meantime I > > > > would like to allow dictionaries to be redefined in this case. Seems > > > > like we might need a vote to formalize this? > > > > > > Isn't the stream format deviating from the file format then? In the > > > file format, IIUC, dictionaries can appear after the respective record > > > batches, so there's no way to tell whether the original or redefined > > > version of a dictionary is being referred to. > > > > You make a good point -- we can consider changes to the file format to > > allow for record batches to have different dictionaries. Even handling > > delta dictionaries with the current file format would be a bit tedious > > (though not indeterminate) > > > > > Regards > > > > > > Antoine. > >