Hi Joris,
I do believe this is missing.  I believe we worked around this for testing
by directly writing dictionary batches to the stream [1].

Thanks,
Micah

[1]
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java#L614

On Thu, Mar 4, 2021 at 4:06 AM Joris Peeters <joris.mg.peet...@gmail.com>
wrote:

> Hello,
>
> For my use case I'm sending an Arrow IPC-stream from a server to a client,
> with some columns being dictionary-encoded. Dictionary-encoding happens on
> the fly, though, so the full dictionary isn't known yet at the beginning of
> the stream, but rather is computed for every batch, and DictionaryBatches
> are to be emitted prior to every RecordBatch.
>
> However, unless I am mistaken, this is not currently supported in the
> ArrowStreamWriter. The dictionary provider is passed in at construction
> time, the dicts are emitted once, and there is no hook for re-emitting
> these.
>
> I've locally hacked around this by basically copy-pasting ArrowStreamWriter
> and extending it with a   `public void writeBatch(DictionaryProvider
> provider)` method, that re-emits the dictionaries prior to emitting the
> record batches.
>
> However, I'd of course much prefer if the provided ArrowStreamWriter
> supported this. If people agree that it's missing (i.e. maybe I'm
> overlooking something obvious) and that it would be useful to have, then
> I'm happy to contribute it myself (not necessarily by using the
> aforementioned `writeBatch(provider)` approach, but seems reasonable).
>
> Cheers,
> -J
>

Reply via email to