Hi Wes,
Yes, sorry for the mess. Here is the message in plain text:
The libgdf project defines a column structure that in a simplified form
could be represented as
typedef struct {
void *data; // column data
unsigned char *valid; // validity mask, one bit per column item
size_t size; // nof items
enum {INT8, INT16, ...} dtype; // type of column item
size_t null_count; // nof non-valid items
} my_column_t;
The aim is to implement IPC protocol for sharing my_column_t data between
host and GPU devices.
What would be the most sensible way to do that using tools available in
Arrow library?
We are currently considering the following approaches:
1. Re-using Arrow Array: my_column_t and Arrow Array have one-to-one
correspondence regarding data content.
2. Defining new Arrow format MyColumn (using Arrow Tensor as an example):
table MyColumn {
/// The type of data contained in a value cell.
type: Type;
/// The number of non-valid items
null_count: long;
/// The location and size of the column's data
data: Buffer;
/// The location and size of the column's mask
valid: Buffer;
}
We are uncertain which approach would be easiest to implement and maintain,
be efficient (0-copy), or would make sense at all.
Defining Arrow MyColumn seems appealing because of about 7 times less code
in Arrow Tensor than in Arrow Array. However, Arrow Array includes validity
mask already.
What do you think?
Best regards,
Pearu
On Wed, Aug 22, 2018 at 11:53 PM, Wes McKinney <[email protected]> wrote:
> Hi Pearu,
>
> Seems the formatting of your email got messed up a little bit. Can you
> resend with some more line breaks?
>
> Thanks
>
>
> On Wed, Aug 22, 2018, 4:46 PM Pearu Peterson <[email protected]
> >
> wrote:
>
> > *Hi,The libgdf project defines a column structure that in a simplified
> form
> > could be represented astypedef struct { void *data;
> //
> > column data unsigned char *valid; // validity mask // one bit per
> column
> > item size_t size; // nof items enum {INT8, INT16,
> > ...} dtype; // type of column item size_t null_count; // nof
> > non-valid items} my_column_t;The aim is to implement IPC protocol for
> > sharing my_column_t data between host and GPU devices. What would be the
> > most sensible way to do that using tools available in Arrow library?We
> are
> > currently considering the following approaches:1. Re-using Arrow Array
> > (C++): my_column_t and Arrow Array have one-to-one correspondence
> regarding
> > data content.2. Defining new Arrow format MyColumn (using Arrow Tensor as
> > an example):table MyColumn { /// The type of data contained in a value
> > cell. type: Type; /// The number of non-valid items null_count: long;
> > /// The location and size of the column's data data: Buffer; /// The
> > location and size of the column's mask valid: Buffer;}We are uncertain
> > which approach would be easiest to implement and maintain, be efficient
> > (0-copy), or would make sense at all.Defining Arrow MyColumn seems
> > appealing because of about 7 times less code in Arrow Tensor than in
> Arrow
> > Array. However, Arrow Array includes validity mask already.What do you
> > think?Best regards,Pearu*
> >
>