Cheers. Made and self-assigned https://issues.apache.org/jira/browse/ARROW-11869.
On Fri, Mar 5, 2021 at 1:44 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Joris, > I do believe this is missing. I believe we worked around this for testing > by directly writing dictionary batches to the stream [1]. > > Thanks, > Micah > > [1] > > https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/ipc/TestArrowReaderWriter.java#L614 > > On Thu, Mar 4, 2021 at 4:06 AM Joris Peeters <joris.mg.peet...@gmail.com> > wrote: > > > Hello, > > > > For my use case I'm sending an Arrow IPC-stream from a server to a > client, > > with some columns being dictionary-encoded. Dictionary-encoding happens > on > > the fly, though, so the full dictionary isn't known yet at the beginning > of > > the stream, but rather is computed for every batch, and DictionaryBatches > > are to be emitted prior to every RecordBatch. > > > > However, unless I am mistaken, this is not currently supported in the > > ArrowStreamWriter. The dictionary provider is passed in at construction > > time, the dicts are emitted once, and there is no hook for re-emitting > > these. > > > > I've locally hacked around this by basically copy-pasting > ArrowStreamWriter > > and extending it with a `public void writeBatch(DictionaryProvider > > provider)` method, that re-emits the dictionaries prior to emitting the > > record batches. > > > > However, I'd of course much prefer if the provided ArrowStreamWriter > > supported this. If people agree that it's missing (i.e. maybe I'm > > overlooking something obvious) and that it would be useful to have, then > > I'm happy to contribute it myself (not necessarily by using the > > aforementioned `writeBatch(provider)` approach, but seems reasonable). > > > > Cheers, > > -J > > >