Hi folks, I'm working on multi-batch dictionary with delta support in Java [1] and would like some clarifications. Given the "isDelta" flag in the dictionary message [2], when should this be set to "true"?
1) If we have dictionary with an ID of 1 that we want to delta encode and it is used across multiple batches, should the initial batch have `isDelta=false` then subsequent batches have `isDelta=true`? E.g. batch 1, dict 1, isDelta=false, dictVector=[a, b, c], indexVector=[0, 1, 1, 2] batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1] batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4] Or should the flag be true for the entire IPC flow? E.g. batch 1, dict 1, isDelta=true, dictVector=[a, b, c], indexVector=[0, 1, 1, 2] batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1] batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4, 3] Either works for me. 2) Could (in stream, not file IPCs) a single dictionary ever switch state across batches from delta to replacement mode or vice-versa? E.g. batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1, 1, 2] batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[2, 3, 0, 1] batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1, 2] I'd like to keep the protocol and API simple and assume switching is not allowed. This would mean the 2nd example above would be canonical. 3) Are replacement dictionaries required to be serialized for every batch or is a dictionary re-used across batches until a replacement is received? The CPP IPC API has 'unify_dictionaries' [3] that mentions "a column with a dictionary type must have the same dictionary in each record batch". I assume (and prefer) the latter, that replacements are serialized once and re-used. E.g. batch 1, dict 1, isDelta = false, dictVector=[a, b, c], indexVector=[0, 1, 1, 2] batch 2, dict 1, isDelta = false, dictVector=[], indexVector=[2, 1, 0, 1] // use previous dictionary batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1, 2] // replacement And I assume that 'unify_dictionaries' simply concatenates all dictionaries into a single vector serialized in the first batch (haven't looked at the code yet). 4) Is it valid for a delta dictionary to have an update in a subsequent batch even though the update is not used in that batch? A silly example would be: batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1, 1, 2] batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[null, null, null, null] batch 3, dict 1, isDelta = true, dictVector=[], indexVector=[0, 3, 2] Thanks for your help! [1] https://github.com/apache/arrow/pull/38423 [2] https://github.com/apache/arrow/blob/main/format/Message.fbs#L134 [3] https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions18unify_dictionariesE -- Chris Larsen