modeling column group

Yue Ni Sun, 01 Jan 2023 06:02:27 -0800

Hi there,

Happy new year.


I store some data in arrow IPC files. And I have two fields that are always
accessed at the same time, namely, when accessing these two fields, they
are accessed in a row oriented manner and are always fetched together, but
other fields are accessed in columnar manner. One of the fields is a string
field, and the other is an int32 field. I would like to know if there is
any canonical approach for modeling this kind of usage in arrow.

The IPC files are memory mapped, and are randomly accessed. Because of the
columnar storage,  when accessing the two fields of the same row, it
requires 2 random accesses to do it. Since I know the access pattern for
these two fields is always reading together, theoretically it can be
reduced to 1 random access when fetching them. Initially I read doc about
struct layout (
https://arrow.apache.org/docs/format/Columnar.html#struct-layout), but it
seems still storing and accessing the data in a columnar manner so it
doesn't help. I could probably use some proprietary encoding to encode
these two fields into a single field, but it is not elegant and somewhat
less portable. Is there any canonical approach in arrow for modeling such
usage? Thanks.

Regards,
Yue

modeling column group

Reply via email to