Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-12 Thread Antoine Pitrou
I think it would be worthwhile to split the discussion into two separate threads. One thread for compression & encodings (which are related or even the same topic), one thread for data integrity. Regards Antoine. Le 08/07/2019 à 07:22, Micah Kornfield a écrit : > > - Compression: >* Us

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-09 Thread Micah Kornfield
Hi Jacques, > That's quite interesting. Can you share more about the use case. Sorry I realized I missed answering this. We are still investigating, so the initial diagnosis might be off. The use-case is a data transfer application, reading data at rest, translating it to arrow and sending it o

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-08 Thread Ji Liu
Hi Micah, Thanks for opening this discussion. Similar to Liya Fan, I generally agree with you in most features. As you mentioned above, we have made some attempts in our application to reduce data size, for example, data encoding and RecordBatch compact[1], and it has significant performance be

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-08 Thread Fan Liya
Hi Micah, Thanks for opening this discussion. For me, most of the features are super useful, especially RLE and integer encoding. IMO, to support these new features, we need some basic algorithms first (e.g. sort and search). For example, RLE and sort are often used in combination. These new fea

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-07 Thread Micah Kornfield
Hi Paul, Jacques and Antoine, Thank you for the valuable feedback. I'm going to try to address it all in this e-mail to help consolidate the conversation. I've grouped my responses by topic and included snippets from other e-mails where relevant. *Timeline of any features: * - So far the sentim

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-06 Thread Jacques Nadeau
> > What is the driving force for transport compression? Are you seeing that >> as a major bottleneck in particular circumstances? (I'm not disagreeing, >> just want to clearly define the particular problem you're worried about.) > > > I've been working on a 20% project where we appear to be IO bou

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-06 Thread Antoine Pitrou
Hi Micah, Le 05/07/2019 à 20:53, Micah Kornfield a écrit : > > Going into more details on the specific features in the PR: > >1. > >Sparse encodings for arrays and buffers. The guiding principles behind >the suggested encodings are to support encodings that can be exploited by >

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-06 Thread Paul Taylor
Hi Micah, Similar to Jacques I'm not disagreeing, but wondering if they belong in Arrow vs. can be done externally. I'm mostly interested in changes that might impact SIMD processing, considering Arrow's already made conscious design decisions to trade memory for speed. Apologies in advance if

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Micah Kornfield
Hi Jacques, I think our e-mails might have crossed, so I'm consolidating my responses from the previous e-mail as well. I don't think most of this should be targeted for 1.0. It is a lot of > change/enhancement and seems like it would likely substantially delay 1.0. I agree it shouldn't block 1.

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Jacques Nadeau
One question and a random thought: What is the driving force for transport compression? Are you seeing that as a major bottleneck in particular circumstances? (I'm not disagreeing, just want to clearly define the particular problem you're worried about.) Random thought: what do you think of defin

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Micah Kornfield
Hi Jacques, Thanks for the quick response. I don't think most of this should be targeted for 1.0. It is a lot of > change/enhancement and seems like it would likely substantially delay 1.0. I agree it shouldn't block 1.0. I think time based releases are working well for the community.But if

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Micah Kornfield
Strange, I've pasted the contents into a google document at [1] [1] https://docs.google.com/document/d/1uJzWh63Iqk7FRbElHPhHrsmlfe0NIJ6M8-0kejPmwIw/edit On Fri, Jul 5, 2019 at 12:32 PM Jacques Nadeau wrote: > Hey Micah, you're formatting seems to be messed up on this mail. Some kind > of copy/

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Jacques Nadeau
Initial thought: I don't think most of this should be targeted for 1.0. It is a lot of change/enhancement and seems like it would likely substantially delay 1.0. The one piece that seems least disruptive would be basic on the wire compression. You suggested that this be done on the buffer level but

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

2019-07-05 Thread Jacques Nadeau
Hey Micah, you're formatting seems to be messed up on this mail. Some kind of copy/paste error? On Fri, Jul 5, 2019 at 11:54 AM Micah Kornfield wrote: > Hi Arrow-dev, > > I’d like to make a straw-man proposal to cover some features that I think > would be useful to Arrow, and that I would like t