Re: [Format] Semantics for dictionary batches in streams

Micah Kornfield Sat, 10 Aug 2019 14:20:45 -0700

Reading data from two different parquet files sequentially with different
dictionaries for the same column.  This could be handled by re-encoding
data but that seems potentially sub-optimal.


On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <[email protected]> wrote:

> What situation are anticipating where you're going to be restating ids mid
> stream?
>
> On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <[email protected]>
> wrote:
>
>> The IPC specification [1] defines behavior when isDelta on a
>> DictionaryBatch [2] is "true".  I might have missed it in the
>> specification, but I couldn't find the interpretation for what the
>> expected
>> behavior is when isDelta=false and and two  dictionary batches  with the
>> same ID are sent.
>>
>> It seems like there are two options:
>> 1.  Interpret the new dictionary batch as replacing the old one.
>> 2.  Regard this as an error condition.
>>
>> Based on the fact that in the "file format" dictionaries are allowed to be
>> placed in any order relative to the record batches, I assume it is the
>> second, but just wanted to make sure.
>>
>> Thanks,
>> Micah
>>
>> [1] https://arrow.apache.org/docs/ipc.html
>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72
>>
>

Re: [Format] Semantics for dictionary batches in streams

Reply via email to