hi folks, Some time ago I opened ARROW-1790 based on some discussions I'd had with users on mailing list or in person about how to deal with data similar to a C array of struct types. Indeed, while we have Structs in the Arrow columnar format, our structs are "fully shredded" columnar structs.
Many systems such as Apache Impala (TupleRow, used in row batches), Apache Kudu (used in client RPCs), Apache Spark (off-heap "unsafe row" aka Tungsten), NumPy (structured dtypes), and others have in-memory data structures supporting record oriented data. As far as I know, there is not an open standard for this type of data. The purpose of developing this within Apache Arrow would serve a couple purposes: * To have an open standard for in-memory records under ASF community governance. Achieving consensus in this setting would have a lot of long-term value and accelerate adoption * To provide a means to embed sequences of records in the Arrow columnar format In light of efforts to create LLVM codegen infrastructure for Arrow (Gandiva), it would stand to reason that we could develop LLVM IR for manipulating columns of records in a coherent algebraic expression framework. For example: efficient LLVM code generation for "shredding" or "pivoting" records into fully-shredded columnar format. If this sounds interesting to the community, I could help to kickstart a design process which would likely take a significant amount of time. The requirements could be complex (i.e. we might want to support variable-size record fields while also providing random access guarantees). We could use the ASF's Confluence wiki to house the documents and facilitate discussion. Thanks, Wes