Hi roee,

It seems that we have both raw value and encoded value types in the Java
implementation, so there is no information loss?

In particular, we have org.apache.arrow.vector.types.pojo.FieldType#type
for the raw type
and org.apache.arrow.vector.types.pojo.FieldType#dictionary#indexType for
the encoded type.

Best,
Liya Fan


On Thu, Aug 26, 2021 at 10:09 AM Hongze Zhang <notify...@126.com> wrote:

> On Wed, 2021-08-25 at 21:02 +0300, roee shlomo wrote:
>
> > This means that an API to import an ArrowSchema (in C) into a
> > Field/Schema
> > (in Java) is not suitable for dictionary encoded arrays because there
> > is an
> > information loss. Specifically, there is nothing in Field/Schema to
> > indicate the value type as far as we can tell.
>
> I think maybe IPC's code can be reference here:
>
> 1. (C++) Serialization of field with dictionary:
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata_internal.cc#L696-L735
>
> 2. (Java) Deserialization of field with dictionary:
>
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java#L133-L177
>
> And this piece of code shows how Java Arrow schema organizes dict index
> type and value type:
>
> https://github.com/apache/arrow/blob/5003278ded77f1ab385425143aafd085fda1f701/java/vector/src/test/java/org/apache/arrow/vector/ipc/MessageSerializerTest.java#L143-L155
>
>
> > Even if that were solved, importing dictionary encoded arrays is too
> > complex from a user point of view. We would need to import both the
> > vector
> > and a dictionary provider (i.e. multiple return values in some cases)
> > and
> > the user would be responsible for taking ownership of every vector in
> > the
> > dictionary provider and eventually closing it. This adds a lot of
> > complexity for cases like importing ArrowArray (C) into an existing
> > VectorSchemaRoot (when importing in batches).
>
> If VectorSchemaRoot doesn't cooperate here, would it be an option to
> have another API to export/import via Java
> ArrowRecordBatch/ArrowDictionaryBatch or some sort of composite buffer-
> based structure, which doesn't utilize Java Vector facilities at all?
> Users would always be able to have these buffers loaded via
> VectorLoader/VectorSchemaRoot/ArrowReader by themselves with an already
> imported schema object.
>
> Best,
> Hongze
>
>

Reply via email to