Hi Jorge, > Wrt to the extension type: I am not sure we can make it fast, though: the > interpretation of the bytes would need to be done dynamically (instead of > statically) because we can't compile the struct prior to receiving it (via > IPC or FFI). This interpretation would be part of hot loops (as we would > need to interpret the bytes on every element).
I think we might be thinking about this differently. In my mind an extension type (at least in C++) is a really dynamic lookup to statically compiled code. So operations that we care about could be compiled ahead of time (they would require downcasting an Array to its specific extension type). For MonthDayNanos the struct would be struct<int32, int32, int64> but there would still be a class MonthDayNanos which extends ExtensionArray that could have "hot path" operations precompiled. For truly dynamic structs, I agree they would introduce some amount of overhead but I'm not sure how bad it would actually be. This is an important question if we ever get to a point where someone wants to propose a row-oriented analogue to the existing column oriented specification.. For this to work efficiently, IMO we would need some kind of "c extension" > whereby people could declare a c struct as part of the extension, which > consumers would compile to their own language for consumption. This is my understanding of the JIRA linked earlier for packed struct. It is a language independent way to define a fixed memory layout. Fancier implementations could potentially do JIT based on the definition. Micah, I was thinking about the page with the memory layout [1], > specifically the primitive section, where some mental effort is required to > interpret the interval types as primitives (but not the FixedSizeBinary); > my understanding is that the former has a known packed struct while the > later does not. Unfortunately primitives is an overloaded term. I've come to generally understand it as everything that isn't a Union, Struct (interpreted as a bag of columns) or List. Cheers, Micah On Thu, Sep 2, 2021 at 7:40 AM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote: > Hi, > > Thanks a lot for the feedback. Atm I was really just trying to get whether > others also saw these types as these packed structs. > > Wrt to the extension type: I am not sure we can make it fast, though: the > interpretation of the bytes would need to be done dynamically (instead of > statically) because we can't compile the struct prior to receiving it (via > IPC or FFI). This interpretation would be part of hot loops (as we would > need to interpret the bytes on every element). > > For this to work efficiently, IMO we would need some kind of "c extension" > whereby people could declare a c struct as part of the extension, which > consumers would compile to their own language for consumption. My > understanding is that in essence this is what we have been doing for the > interval types when we write things like > > "A triple of the number of elapsed months, days, and nanoseconds. > // The values are stored contiguously in 16 byte blocks. Months and > // days are encoded as 32 bit integers and nanoseconds is encoded as a > // 64 bit integer. All integers are signed." > > declare the struct, which implementations hard-code on their source code. > > It is interesting that these resemble the idea of protobuf and thrift but > at the intra-process level (FFI). > > Micah, I was thinking about the page with the memory layout [1], > specifically the primitive section, where some mental effort is required to > interpret the interval types as primitives (but not the FixedSizeBinary); > my understanding is that the former has a known packed struct while the > later does not. > > Best, > Jorge > > [1] > https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout > > > > > > On Thu, Sep 2, 2021 at 4:45 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> I agree, it is what I would have proposed for the interval type if there >> wasn't an interval type in Arrow already. I think FixedSizeList has for >> better or worse solved a lot of the problems that a struct type would be >> used for (e.g. coordinates) >> >> Cheers, >> Micah >> >> On Tue, Aug 31, 2021 at 8:27 AM Wes McKinney <wesmck...@gmail.com> wrote: >> >> > I do still think that having a "packed C struct" type would be a >> > useful thing, but thus far no one has needed it enough to develop >> > something in the columnar format specification. >> > >> > On Tue, Aug 31, 2021 at 1:33 AM Micah Kornfield <emkornfi...@gmail.com> >> > wrote: >> > > >> > > Hi Jorge, >> > > Are there places in the docs that you think this would simplify? >> > > There is an old JIRA [1] about introducing a c-struct type that I >> > > think aligns with this observation [1] >> > > >> > > -Micah >> > > >> > > [1] https://issues.apache.org/jira/browse/ARROW-1790 >> > > >> > > On Mon, Aug 30, 2021 at 2:57 PM Jorge Cardoso Leitão >> > > <jorgecarlei...@gmail.com> wrote: >> > > > >> > > > Hi, >> > > > >> > > > Just came across this curiosity that IMO may help us to design >> physical >> > > > types in the future. >> > > > >> > > > Not sure if this was mentioned before, but it seems to me that >> > > > `DaysMilliseconds` and `MonthDayNano` belong to a broader class of >> > physical >> > > > types "typed tuples" in that they are constructed by defining the >> tuple >> > > > `(t_1,t_2,...,t_N)` where t_i (e.g. int32) is representable in >> memory >> > for a >> > > > given endianess, and each element of the array is written to the >> buffer >> > > > back to back as `<t1 in endianess><t2 in endianess>...<tN in >> > endianess>`. >> > > > >> > > > Primitive arrays such as e.g. `Int32Array` are the extreme case >> where >> > the >> > > > tuple has a single entry (t1,), which leads to `<int32 in >> endianess>`. >> > The >> > > > others are: >> > > > * DaysMilliseconds = (int32, int32) >> > > > * MonthDayNano = (int32, int32, int64) >> > > > >> > > > In principle, we could re-write the in-memory layout page in these >> > terms >> > > > that places all the types above in the same "bucket". >> > > > >> > > > Best, >> > > > Jorge >> > >> >