Hello Micah, thanks for the valuable input, replies inline On Fri, 13 Sept 2024 at 01:06, Micah Kornfield <emkornfi...@gmail.com> wrote:
> IIUC you are > advocating for something which to some extent is a generalization of > Iceberg's REST catalog planning API [1]. But instead of just Iceberg, it > would encompass other formats, and also be more flexible on negotiating > which parts of the computation client vs server are responsible for? Yes, this is a possible rough description of table transfer protocols. However, I de-emphasize "formats" and instead focus on column-level encodings. > The main downside to this approach is you end up requiring client side > libraries for any potential scan operation (e.g. Iceberg, Parquet, Nimble, > etc) or you need to negotiate with the server on supported tasks. This is > solvable but adds complexity to deployments and operational debuggability. > Yes. The server tells the client what transfer encodings it supports for each column, but Arrow encodings are always among them (at least if the column type itself is known to Arrow) as a fallback. The client may decide whether it can perform the scan operations on its side for the given at-rest encoding/compression. Then, if the client doesn't doesn't support the scan operator for the at-rest encoding, it can either request the column in Arrow format (that is, require the server to decode the column on its side) or, to trade off the network traffic with its own memory usage and CPU, still request the client to transfer the column in the at-rest format, then decode it without applying the scan operation, and then apply the scan operation on each decoded stripe/block in a generic way. I agree this is complicated and re-introduces sub-optimality if the processing engine (client) doesn't support the at-rest file format. Though in the absence of table transfer protocols, the situation is not any better. However, it seems that a combination of Velox *or* DataFusion (libraries specialised in scan operators well on top of different file formats) *with* Apache xTable that you referenced below (metadata abstraction) can go a pretty long way towards my vision. Still, not *all* the way: 1) it doesn't enable closed-source file formats to join the party: Firebolt, Oxla, SingleStore, Snowflake, etc., 2) doesn't leverage SSD/memory caches of column data and/or indexes, and 3) alternative, non-object storage architectures a-la Vastdata, BigQuery Managed storage, Meta's Tectonic, etc. 5) *Reading compressed data* from object storage to external processing > > engines -- Arrow Flight requires convert everything into Arrow before > > sending. If the processing job requires to pull a lot of Parquet > > files/columns (even after filtering partitions) converting them all to > > Arrow before sending as the minimum can increase the job latency quite a > > bit, as the maximum will cost much more money for egress. > > The Disassociated IPC protocol [2] was a start in this direction, and I > believe the original protocol mentioned referencing parquet files (which > was removed). > Interesting, I didn't know that. I suggested basing table transfer protocols on (unnamed) streaming protocol for columnar data: https://github.com/apache/arrow/issues/43762, whose main differences from Dissociated IPC are 1) no hard requirement to use Arrow column encodings only, and 2) reactive streams-style async flow control.