Hi folks,

I'm working on multi-batch dictionary with delta support in Java [1] and
would like some clarifications. Given the "isDelta" flag in the dictionary
message [2], when should this be set to "true"?

1) If we have dictionary with an ID of 1 that we want to delta encode and
it is used across multiple batches, should the initial batch have
`isDelta=false` then subsequent batches have `isDelta=true`? E.g.

batch 1, dict 1, isDelta=false, dictVector=[a, b, c], indexVector=[0, 1, 1,
2]
batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4]

Or should the flag be true for the entire IPC flow? E.g.

batch 1, dict 1, isDelta=true, dictVector=[a, b, c], indexVector=[0, 1, 1,
2]
batch 2, dict 1, isDelta=true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta=true, dictVector=[e], indexVector=[0, 4, 3]

Either works for me.

2) Could (in stream, not file IPCs) a single dictionary ever switch state
across batches from delta to replacement mode or vice-versa? E.g.

batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[2, 3, 0, 1]
batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1,
2]

I'd like to keep the protocol and API simple and assume switching is not
allowed. This would mean the 2nd example above would be canonical.

3) Are replacement dictionaries required to be serialized for every batch
or is a dictionary re-used across batches until a replacement is received?
The CPP IPC API has 'unify_dictionaries' [3] that mentions "a column with a
dictionary type must have the same dictionary in each record batch". I
assume (and prefer) the latter, that replacements are serialized once and
re-used. E.g.

batch 1, dict 1, isDelta = false, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = false, dictVector=[], indexVector=[2, 1, 0, 1]
// use previous dictionary
batch 3, dict 1, isDelta = false, dictVector=[c, a, d], indexVector=[0, 1,
2] // replacement

And I assume that 'unify_dictionaries' simply concatenates all dictionaries
into a single vector serialized in the first batch (haven't looked at the
code yet).

4) Is it valid for a delta dictionary to have an update in a subsequent
batch even though the update is not used in that batch? A silly example
would be:

batch 1, dict 1, isDelta = true, dictVector=[a, b, c], indexVector=[0, 1,
1, 2]
batch 2, dict 1, isDelta = true, dictVector=[d], indexVector=[null, null,
null, null]
batch 3, dict 1, isDelta = true, dictVector=[], indexVector=[0, 3, 2]

Thanks for your help!

[1] https://github.com/apache/arrow/pull/38423
[2] https://github.com/apache/arrow/blob/main/format/Message.fbs#L134
[3]
https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc15IpcWriteOptions18unify_dictionariesE

-- 


Chris Larsen

Reply via email to