Hi Jacques, > That's quite interesting. Can you share more about the use case.
Sorry I realized I missed answering this. We are still investigating, so the initial diagnosis might be off. The use-case is a data transfer application, reading data at rest, translating it to arrow and sending it out to clients. I look forward hearing your thoughts on the rest of the proposal. Thanks, Micah On Sat, Jul 6, 2019 at 2:53 PM Jacques Nadeau <jacq...@apache.org> wrote: > What is the driving force for transport compression? Are you seeing that >>> as a major bottleneck in particular circumstances? (I'm not disagreeing, >>> just want to clearly define the particular problem you're worried about.) >> >> >> I've been working on a 20% project where we appear to be IO bound for >> transporting record batches. Also, I believe Ji Liu (tianchen92) has been >> seeing some of the same bottlenecks with the query engine they are is >> working on. Trading off some CPU here would allow us to lower the overall >> latency in the system. >> > > That's quite interesting. Can you share more about the use case. With the > exception of broadcast and round-robin type distribution patterns, we find > that there is typically more cycles focused on partitioning the sending > data such that IO bounding is less of a problem. In most of our operations, > almost all the largest workloads are done via partitioning thus it isn't > typically a problem. (We also have clients with 10gbps and 100gbps network > interconnects...) Are you partitioning the data pre-send? > > > >> Random thought: what do you think of defining this at the transport level >>> rather than the record batch level? (e.g. in Arrow Flight). This is one way >>> to avoid extending the core record batch concept with something that isn't >>> related to processing (at least in your initial proposal) >> >> >> Per above, this seems like a reasonable approach to me if we want to hold >> off on buffer level compression. Another use-case for buffer/record-batch >> level compression would be the Feather file format for only decompressing >> subset of columns/rows. If this use-case isn't compelling, I'd be happy to >> hold off adding compression to sparse batches until we have benchmarks >> showing the trade-off between channel level and buffer level compression. >> > > I was proposing that type specific buffer encodings be done at the Flight > level, not message level encodings. Just want to make sure the formats > don't leak into the core spec until we're ready. >