Reading data from two different parquet files sequentially with different dictionaries for the same column. This could be handled by re-encoding data but that seems potentially sub-optimal.
On Sat, Aug 10, 2019 at 12:38 PM Jacques Nadeau <jacq...@apache.org> wrote: > What situation are anticipating where you're going to be restating ids mid > stream? > > On Sat, Aug 10, 2019 at 12:13 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> The IPC specification [1] defines behavior when isDelta on a >> DictionaryBatch [2] is "true". I might have missed it in the >> specification, but I couldn't find the interpretation for what the >> expected >> behavior is when isDelta=false and and two dictionary batches with the >> same ID are sent. >> >> It seems like there are two options: >> 1. Interpret the new dictionary batch as replacing the old one. >> 2. Regard this as an error condition. >> >> Based on the fact that in the "file format" dictionaries are allowed to be >> placed in any order relative to the record batches, I assume it is the >> second, but just wanted to make sure. >> >> Thanks, >> Micah >> >> [1] https://arrow.apache.org/docs/ipc.html >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L72 >> >