Hi Chris,
My interpretations:
1) I'm not sure it is clearly defined, but my impression is the first
dictionary is never a delta dictionary (option 1)
2) I don't think they are prevented from switching state (which I supposed
is more complicated?) but hopefully not by much?
3) Dictionaries are reused across batches unless replaced.
4)  I'm not sure I understand this question.  Dictionary should be passed
independently of indexes?

Thanks,
Micah

On Fri, Jan 19, 2024 at 1:55 PM Chris Larsen <clar...@netflix.com.invalid>
wrote:

> Hi folks,
>
> I'm working on multi-batch dictionary with delta support in Java [1] and
> would like some clarifications. Given the "isDelta" flag in the dictionary
> message [2], when should this be set to "true"?
>
> 1) If we have dictionary with an ID of 1 that we want to delta encode and
> it is used across multiple batches, should the initial batch have
> `isDelta=false` then subsequent batches have `isDelta=true`? E.g.
>
> batch 1, dict 1, isDelta=false, dictVector=[a, b, c], indexVector=[0, 1, 1,
> 2]
> batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1]
> batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4]
>
> Or should the flag be true for the entire IPC flow? E.g.
>
> batch 1, dict 1, isDelta=true, dictVector=[a, b, c], indexVector=[0, 1, 1,
> 2]
> batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1]
> batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4, 3]
>
> Either works for me.
>
> 2) Could (in stream, not file IPCs) a single dictionary ever switch state
> across batches from delta to replacement mode or vice-versa? E.g.
>
> batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1,
> 1, 2]
> batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[2, 3, 0, 1]
> batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1,
> 2]
>
> I'd like to keep the protocol and API simple and assume switching is not
> allowed. This would mean the 2nd example above would be canonical.
>
> 3) Are replacement dictionaries required to be serialized for every batch
> or is a dictionary re-used across batches until a replacement is received?
> The CPP IPC API has 'unify_dictionaries' [3] that mentions "a column with a
> dictionary type must have the same dictionary in each record batch". I
> assume (and prefer) the latter, that replacements are serialized once and
> re-used. E.g.
>
> batch 1, dict 1, isDelta = false, dictVector=[a, b, c], indexVector=[0, 1,
> 1, 2]
> batch 2, dict 1, isDelta = false, dictVector=[], indexVector=[2, 1, 0, 1]
> // use previous dictionary
> batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1,
> 2] // replacement
>
> And I assume that 'unify_dictionaries' simply concatenates all dictionaries
> into a single vector serialized in the first batch (haven't looked at the
> code yet).
>
> 4) Is it valid for a delta dictionary to have an update in a subsequent
> batch even though the update is not used in that batch? A silly example
> would be:
>
> batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1,
> 1, 2]
> batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[null, null,
> null, null]
> batch 3, dict 1, isDelta = true, dictVector=[], indexVector=[0, 3, 2]
>
> Thanks for your help!
>
> [1] https://github.com/apache/arrow/pull/38423
> [2] https://github.com/apache/arrow/blob/main/format/Message.fbs#L134
> [3]
>
> https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions18unify_dictionariesE
>
> --
>
>
> Chris Larsen
>

Reply via email to