Hello,

For my use case I'm sending an Arrow IPC-stream from a server to a client,
with some columns being dictionary-encoded. Dictionary-encoding happens on
the fly, though, so the full dictionary isn't known yet at the beginning of
the stream, but rather is computed for every batch, and DictionaryBatches
are to be emitted prior to every RecordBatch.

However, unless I am mistaken, this is not currently supported in the
ArrowStreamWriter. The dictionary provider is passed in at construction
time, the dicts are emitted once, and there is no hook for re-emitting
these.

I've locally hacked around this by basically copy-pasting ArrowStreamWriter
and extending it with a   `public void writeBatch(DictionaryProvider
provider)` method, that re-emits the dictionaries prior to emitting the
record batches.

However, I'd of course much prefer if the provided ArrowStreamWriter
supported this. If people agree that it's missing (i.e. maybe I'm
overlooking something obvious) and that it would be useful to have, then
I'm happy to contribute it myself (not necessarily by using the
aforementioned `writeBatch(provider)` approach, but seems reasonable).

Cheers,
-J

Reply via email to