Thanks so much Weston. Both [1][2] are informative, and I will check them out. Thanks.
On Mon, Jan 2, 2023 at 5:05 AM Weston Pace <weston.p...@gmail.com> wrote: > There was a discussion a while back about representing complex numbers > that seems similar[1]. If both fields were the same type you could > use a fixed size list array. However, since you want two different > types you'd want some kind of "packed struct" which does not exist (to > my knowledge) today. Also, given that one of the fields is a string > it would be a bit of a challenge. > > There is a layout kind of like this in the hash-table/group-by > implementation. We use a row-encoding scheme in the hash-table. All > fixed size types are encoded first and then the variable types come at > the end. I can't remember off the top of my head if the lengths of > the variable sized fields are encoded as fixed size types or in a > separate array. However, this is internal, not thoroughly documented, > and probably just useful for inspiration at the moment. > > [1] https://lists.apache.org/thread/m8jnrfzozq1dx66twzc80vbyr6r365yf > [2] > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/row/row_internal.h > > On Sun, Jan 1, 2023 at 6:02 AM Yue Ni <niyue....@gmail.com> wrote: > > > > Hi there, > > > > Happy new year. > > > > I store some data in arrow IPC files. And I have two fields that are > always > > accessed at the same time, namely, when accessing these two fields, they > > are accessed in a row oriented manner and are always fetched together, > but > > other fields are accessed in columnar manner. One of the fields is a > string > > field, and the other is an int32 field. I would like to know if there is > > any canonical approach for modeling this kind of usage in arrow. > > > > The IPC files are memory mapped, and are randomly accessed. Because of > the > > columnar storage, when accessing the two fields of the same row, it > > requires 2 random accesses to do it. Since I know the access pattern for > > these two fields is always reading together, theoretically it can be > > reduced to 1 random access when fetching them. Initially I read doc about > > struct layout ( > > https://arrow.apache.org/docs/format/Columnar.html#struct-layout), but > it > > seems still storing and accessing the data in a columnar manner so it > > doesn't help. I could probably use some proprietary encoding to encode > > these two fields into a single field, but it is not elegant and somewhat > > less portable. Is there any canonical approach in arrow for modeling such > > usage? Thanks. > > > > Regards, > > Yue >