On Wed, 2021-08-25 at 21:02 +0300, roee shlomo wrote:

> This means that an API to import an ArrowSchema (in C) into a
> Field/Schema
> (in Java) is not suitable for dictionary encoded arrays because there
> is an
> information loss. Specifically, there is nothing in Field/Schema to
> indicate the value type as far as we can tell.

I think maybe IPC's code can be reference here:

1. (C++) Serialization of field with dictionary:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata_internal.cc#L696-L735

2. (Java) Deserialization of field with dictionary:
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L133-L177

And this piece of code shows how Java Arrow schema organizes dict index
type and value type:
https://github.com/apache/arrow/blob/5003278ded77f1ab385425143aafd085fda1f701/java/vector/src/test/java/org/apache/arrow/vector/ipc/MessageSerializerTest.java#L143-L155


> Even if that were solved, importing dictionary encoded arrays is too
> complex from a user point of view. We would need to import both the
> vector
> and a dictionary provider (i.e. multiple return values in some cases)
> and
> the user would be responsible for taking ownership of every vector in
> the
> dictionary provider and eventually closing it. This adds a lot of
> complexity for cases like importing ArrowArray (C) into an existing
> VectorSchemaRoot (when importing in batches).

If VectorSchemaRoot doesn't cooperate here, would it be an option to
have another API to export/import via Java
ArrowRecordBatch/ArrowDictionaryBatch or some sort of composite buffer-
based structure, which doesn't utilize Java Vector facilities at all?
Users would always be able to have these buffers loaded via
VectorLoader/VectorSchemaRoot/ArrowReader by themselves with an already
imported schema object.

Best,
Hongze

Reply via email to