Re: modeling column group

Yue Ni Sun, 01 Jan 2023 23:21:06 -0800

Thanks so much Weston. Both [1][2] are informative, and I will check them
out. Thanks.


On Mon, Jan 2, 2023 at 5:05 AM Weston Pace <weston.p...@gmail.com> wrote:

> There was a discussion a while back about representing complex numbers
> that seems similar[1].  If both fields were the same type you could
> use a fixed size list array.  However, since you want two different
> types you'd want some kind of "packed struct" which does not exist (to
> my knowledge) today.  Also, given that one of the fields is a string
> it would be a bit of a challenge.
>
> There is a layout kind of like this in the hash-table/group-by
> implementation.  We use a row-encoding scheme in the hash-table.  All
> fixed size types are encoded first and then the variable types come at
> the end.  I can't remember off the top of my head if the lengths of
> the variable sized fields are encoded as fixed size types or in a
> separate array.  However, this is internal, not thoroughly documented,
> and probably just useful for inspiration at the moment.
>
> [1] https://lists.apache.org/thread/m8jnrfzozq1dx66twzc80vbyr6r365yf
> [2]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/row/row_internal.h
>
> On Sun, Jan 1, 2023 at 6:02 AM Yue Ni <niyue....@gmail.com> wrote:
> >
> > Hi there,
> >
> > Happy new year.
> >
> > I store some data in arrow IPC files. And I have two fields that are
> always
> > accessed at the same time, namely, when accessing these two fields, they
> > are accessed in a row oriented manner and are always fetched together,
> but
> > other fields are accessed in columnar manner. One of the fields is a
> string
> > field, and the other is an int32 field. I would like to know if there is
> > any canonical approach for modeling this kind of usage in arrow.
> >
> > The IPC files are memory mapped, and are randomly accessed. Because of
> the
> > columnar storage,  when accessing the two fields of the same row, it
> > requires 2 random accesses to do it. Since I know the access pattern for
> > these two fields is always reading together, theoretically it can be
> > reduced to 1 random access when fetching them. Initially I read doc about
> > struct layout (
> > https://arrow.apache.org/docs/format/Columnar.html#struct-layout), but
> it
> > seems still storing and accessing the data in a columnar manner so it
> > doesn't help. I could probably use some proprietary encoding to encode
> > these two fields into a single field, but it is not elegant and somewhat
> > less portable. Is there any canonical approach in arrow for modeling such
> > usage? Thanks.
> >
> > Regards,
> > Yue
>

Re: modeling column group

Reply via email to