I am wondering if this can be considered as an opportunity to implement
support in Arrow for building high performance in-memory row stores for low
latency and high throughput key based queries. In other words, we can
design the in-memory record format keeping efficient RDMA reads as one of
the goals too. Consider two data structures in memory -- a  hash table and
a row-store comprising of records in Arrow row format. Hashtable points to
row store and information can be read from both data structures without
interrupting the CPU on server. This client-server code-path support can
also be incorporated into Arrow Flight

On Tue, Jun 26, 2018 at 7:49 PM, Jacques Nadeau <jacq...@apache.org> wrote:

> I'm not sure this makes sense as an external stable api. I definitely think
> it is useful as an internal representation for use within a particular
> algorithm. I also think that can be informed by the particular algorithm
> that you're working on.
>
> We definitely had this requirement in Dremio and came up with an internal
> representation that we are happy with for the use in hash tables. I'll try
> to dig up the design docs we had around this but the actual
> pivoting/unpivoting code that we developed can be seen here: [1], [2].
>
> Our main model is two blocks: a fixed width block and a variable width
> block (with the fixed width block also carrying address & length of the
> variable data). Fixed width is randomly accessible and variable width is
> randomly accessible through fixed width.
>
> [1]
> https://github.com/dremio/dremio-oss/blob/master/sabot/
> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java
> [2]
> https://github.com/dremio/dremio-oss/blob/master/sabot/
> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java
>
> On Tue, Jun 26, 2018 at 10:20 AM, Wes McKinney <wesmck...@gmail.com>
> wrote:
>
> > hi Antoine,
> >
> > On Sun, Jun 24, 2018 at 1:06 PM, Antoine Pitrou <anto...@python.org>
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > Le 24/06/2018 à 08:24, Wes McKinney a écrit :
> > >>
> > >> If this sounds interesting to the community, I could help to kickstart
> > >> a design process which would likely take a significant amount of time.
> > >> The requirements could be complex (i.e. we might want to support
> > >> variable-size record fields while also providing random access
> > >> guarantees).
> > >
> > > What do you call "variable-sized" here? A scheme where the length of a
> > > record's field is determined by the value of another field in the same
> > > record?
> >
> > As an example, here is a fixed size record
> >
> > record foo {
> >   a: int32;
> >   b: float64;
> >   c: uint8;
> > }
> >
> > With padding suppose this is 16 bytes per record; so if we have a
> > column of these, then random accessing any value in any record is
> > simple.
> >
> > Here's a variable-length record:
> >
> > record bar {
> >   a: string;
> >   b: list<int32>;
> > }
> >
> > What I've seen done to represent this in memory is to have a fixed
> > size record followed by a sidecar containing the variable-length data,
> > so the fixed size portion might look something like
> >
> > a_offset: int32;
> > a_length: int32;
> > b_offset: int32;
> > b_length: int32;
> >
> > So from this, you can do random access into the record. If you wanted
> > to do random access on a _column_ of such records, it is similar to
> > our current variable-length Binary type. So it might be that the
> > underlying Arrow memory layout would be FixedSizeBinary for fixed-size
> > records and variable Binary for variable-size records.
> >
> > - Wes
> >
> > >
> > >
> > >
> > > Regards
> > >
> > > Antoine.
> >
>

Reply via email to