It is quite possible the dictionary related code in Java could use some rethinking. I recall working with them has been a little bit awkward and I think we had some open JIRAs related to this.
On Thu, Aug 26, 2021 at 12:52 AM roee shlomo <roe...@gmail.com> wrote: > > It seems that we have both raw value and encoded value types in the Java > implementation, so there is no information loss? > > I think that in the Java memory format they are both the index type, see > > https://github.com/apache/arrow/blob/5003278ded77f1ab385425143aafd085fda1f701/java/vector/src/main/java/org/apache/arrow/vector/util/DictionaryUtility.java#L44-L45 > > Users would expect the Java memory format (e.g., to create Vector or > VectorSchemaRoot from it directly). I don't think moving to the ipc format > would be a good idea either, the C data interface is quite different, e.g., > should support import/export of individual vectors. However, the IPC code > is a good reference for learning how to handle dictionaries so I'll go over > it more carefully. >