I added a line item for this on the project wiki. When I or someone has some time, we can spend some time writing a requirements / design sketch document to help with further discussions.
On Wed, Jun 27, 2018 at 7:09 AM, Wes McKinney <wesmck...@gmail.com> wrote: >> I'm not sure this makes sense as an external stable api. I definitely think >> it is useful as an internal representation for use within a particular >> algorithm. I also think that can be informed by the particular algorithm >> that you're working on. > > I agree that this is definitely needed in certain algorithms, e.g. > certain types of hashing. And the memory layout that's best for a > given algorithm might change. Since we have a number of support > structures already in place for columnar data, like dictionary > encoding, it would be easier to implement these things for > row-oriented data. > > I think the question is really about open standards. Our original > focus when we started the project was to develop an open standard for > columnar data. It seems valuable to have one for row-oriented data. > Given how many systems have developed their own internal formats, it > seems like an inevitability. This begs the questions: if it does not > happen here, then where? and if not now, then when? > > That being said, it's hard to say how feasible the project would be > until we gather more requirements and non-requirements. > > On Wed, Jun 27, 2018 at 3:20 AM, Siddharth Teotia <siddha...@dremio.com> > wrote: >> I am wondering if this can be considered as an opportunity to implement >> support in Arrow for building high performance in-memory row stores for low >> latency and high throughput key based queries. In other words, we can >> design the in-memory record format keeping efficient RDMA reads as one of >> the goals too. Consider two data structures in memory -- a hash table and >> a row-store comprising of records in Arrow row format. Hashtable points to >> row store and information can be read from both data structures without >> interrupting the CPU on server. This client-server code-path support can >> also be incorporated into Arrow Flight >> >> On Tue, Jun 26, 2018 at 7:49 PM, Jacques Nadeau <jacq...@apache.org> wrote: >> >>> I'm not sure this makes sense as an external stable api. I definitely think >>> it is useful as an internal representation for use within a particular >>> algorithm. I also think that can be informed by the particular algorithm >>> that you're working on. >>> >>> We definitely had this requirement in Dremio and came up with an internal >>> representation that we are happy with for the use in hash tables. I'll try >>> to dig up the design docs we had around this but the actual >>> pivoting/unpivoting code that we developed can be seen here: [1], [2]. >>> >>> Our main model is two blocks: a fixed width block and a variable width >>> block (with the fixed width block also carrying address & length of the >>> variable data). Fixed width is randomly accessible and variable width is >>> randomly accessible through fixed width. >>> >>> [1] >>> https://github.com/dremio/dremio-oss/blob/master/sabot/ >>> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java >>> [2] >>> https://github.com/dremio/dremio-oss/blob/master/sabot/ >>> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java >>> >>> On Tue, Jun 26, 2018 at 10:20 AM, Wes McKinney <wesmck...@gmail.com> >>> wrote: >>> >>> > hi Antoine, >>> > >>> > On Sun, Jun 24, 2018 at 1:06 PM, Antoine Pitrou <anto...@python.org> >>> > wrote: >>> > > >>> > > Hi Wes, >>> > > >>> > > Le 24/06/2018 à 08:24, Wes McKinney a écrit : >>> > >> >>> > >> If this sounds interesting to the community, I could help to kickstart >>> > >> a design process which would likely take a significant amount of time. >>> > >> The requirements could be complex (i.e. we might want to support >>> > >> variable-size record fields while also providing random access >>> > >> guarantees). >>> > > >>> > > What do you call "variable-sized" here? A scheme where the length of a >>> > > record's field is determined by the value of another field in the same >>> > > record? >>> > >>> > As an example, here is a fixed size record >>> > >>> > record foo { >>> > a: int32; >>> > b: float64; >>> > c: uint8; >>> > } >>> > >>> > With padding suppose this is 16 bytes per record; so if we have a >>> > column of these, then random accessing any value in any record is >>> > simple. >>> > >>> > Here's a variable-length record: >>> > >>> > record bar { >>> > a: string; >>> > b: list<int32>; >>> > } >>> > >>> > What I've seen done to represent this in memory is to have a fixed >>> > size record followed by a sidecar containing the variable-length data, >>> > so the fixed size portion might look something like >>> > >>> > a_offset: int32; >>> > a_length: int32; >>> > b_offset: int32; >>> > b_length: int32; >>> > >>> > So from this, you can do random access into the record. If you wanted >>> > to do random access on a _column_ of such records, it is similar to >>> > our current variable-length Binary type. So it might be that the >>> > underlying Arrow memory layout would be FixedSizeBinary for fixed-size >>> > records and variable Binary for variable-size records. >>> > >>> > - Wes >>> > >>> > > >>> > > >>> > > >>> > > Regards >>> > > >>> > > Antoine. >>> > >>>