Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Micah Kornfield Tue, 09 Jul 2019 22:57:08 -0700

Hi Jacques,

> That's quite interesting. Can you share more about the use case.


Sorry I realized I missed answering this.  We are still investigating, so
the initial diagnosis might be off.  The use-case is a data transfer
application, reading data at rest, translating it to arrow and sending it
out to clients.

I look forward hearing your thoughts on the rest of the proposal.

Thanks,
Micah



On Sat, Jul 6, 2019 at 2:53 PM Jacques Nadeau <jacq...@apache.org> wrote:

> What is the driving force for transport compression? Are you seeing that
>>> as a major bottleneck in particular circumstances? (I'm not disagreeing,
>>> just want to clearly define the particular problem you're worried about.)
>>
>>
>> I've been working on a 20% project where we appear to be IO bound for
>> transporting record batches.   Also, I believe Ji Liu (tianchen92) has been
>> seeing some of the same bottlenecks with the query engine they are is
>> working on.  Trading off some CPU here would allow us to lower the overall
>> latency in the system.
>>
>
> That's quite interesting. Can you share more about the use case. With the
> exception of broadcast and round-robin type distribution patterns, we find
> that there is typically more cycles focused on partitioning the sending
> data such that IO bounding is less of a problem. In most of our operations,
> almost all the largest workloads are done via partitioning thus it isn't
> typically a problem. (We also have clients with 10gbps and 100gbps network
> interconnects...) Are you partitioning the data pre-send?
>
>
>
>> Random thought: what do you think of defining this at the transport level
>>> rather than the record batch level? (e.g. in Arrow Flight). This is one way
>>> to avoid extending the core record batch concept with something that isn't
>>> related to processing (at least in your initial proposal)
>>
>>
>> Per above, this seems like a reasonable approach to me if we want to hold
>> off on buffer level compression.  Another use-case for buffer/record-batch
>> level compression would be the Feather file format for only decompressing
>> subset of columns/rows.  If this use-case isn't compelling, I'd be happy to
>> hold off adding compression to sparse batches until we have benchmarks
>> showing the trade-off between channel level and buffer level compression.
>>
>
> I was proposing that type specific buffer encodings be done at the Flight
> level, not message level encodings. Just want to make sure the formats
> don't leak into the core spec until we're ready.
>

Re: [Discuss] Format additions to Arrow for sparse data and data integrity

Reply via email to