Hi Chris, My interpretations: 1) I'm not sure it is clearly defined, but my impression is the first dictionary is never a delta dictionary (option 1) 2) I don't think they are prevented from switching state (which I supposed is more complicated?) but hopefully not by much? 3) Dictionaries are reused across batches unless replaced. 4) I'm not sure I understand this question. Dictionary should be passed independently of indexes?
Thanks, Micah On Fri, Jan 19, 2024 at 1:55 PM Chris Larsen <clar...@netflix.com.invalid> wrote: > Hi folks, > > I'm working on multi-batch dictionary with delta support in Java [1] and > would like some clarifications. Given the "isDelta" flag in the dictionary > message [2], when should this be set to "true"? > > 1) If we have dictionary with an ID of 1 that we want to delta encode and > it is used across multiple batches, should the initial batch have > `isDelta=false` then subsequent batches have `isDelta=true`? E.g. > > batch 1, dict 1, isDelta=false, dictVector=[a, b, c], indexVector=[0, 1, 1, > 2] > batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1] > batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4] > > Or should the flag be true for the entire IPC flow? E.g. > > batch 1, dict 1, isDelta=true, dictVector=[a, b, c], indexVector=[0, 1, 1, > 2] > batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1] > batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4, 3] > > Either works for me. > > 2) Could (in stream, not file IPCs) a single dictionary ever switch state > across batches from delta to replacement mode or vice-versa? E.g. > > batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1, > 1, 2] > batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[2, 3, 0, 1] > batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1, > 2] > > I'd like to keep the protocol and API simple and assume switching is not > allowed. This would mean the 2nd example above would be canonical. > > 3) Are replacement dictionaries required to be serialized for every batch > or is a dictionary re-used across batches until a replacement is received? > The CPP IPC API has 'unify_dictionaries' [3] that mentions "a column with a > dictionary type must have the same dictionary in each record batch". I > assume (and prefer) the latter, that replacements are serialized once and > re-used. E.g. > > batch 1, dict 1, isDelta = false, dictVector=[a, b, c], indexVector=[0, 1, > 1, 2] > batch 2, dict 1, isDelta = false, dictVector=[], indexVector=[2, 1, 0, 1] > // use previous dictionary > batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1, > 2] // replacement > > And I assume that 'unify_dictionaries' simply concatenates all dictionaries > into a single vector serialized in the first batch (haven't looked at the > code yet). > > 4) Is it valid for a delta dictionary to have an update in a subsequent > batch even though the update is not used in that batch? A silly example > would be: > > batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1, > 1, 2] > batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[null, null, > null, null] > batch 3, dict 1, isDelta = true, dictVector=[], indexVector=[0, 3, 2] > > Thanks for your help! > > [1] https://github.com/apache/arrow/pull/38423 > [2] https://github.com/apache/arrow/blob/main/format/Message.fbs#L134 > [3] > > https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions18unify_dictionariesE > > -- > > > Chris Larsen >