Hello,
I'm perplexed by this discussion. If you want to send highly-compressed files over the network that is already possible: just send Parquet files via HTTP(S) (or another protocol of choice). Arrow Flight is simply a *streaming* protocol that allows sending/requesting the Arrow format over gRPC. The Arrow format was never meant to be high-compression as it's primarily an in-memory format, and only secondarily a transfer/storage format. I don't think it's reasonable to enlarge the scope of Arrow to something else. Regards Antoine. On Tue, 10 Sep 2024 01:29:05 +0800 Roman Leventov <leven...@apache.org> wrote: > Hello Arrow community, > > I've recently proposed a family of "table transfer protocols": Table Read, > Table Write, and Table Replication protocols (provisional names) that I > think can become a full-standing alternative to Iceberg for interop between > diverse databases/data warehouses and processing and ML engines: > https://engineeringideas.substack.com/p/table-transfer-protocols-improved. > > It appears to me that the "convergence on Iceberg" doesn't really serve the > users because the competition in the storage layer is inhibited and > everyone is using the "lowest common denominator" format that has > limitations. See more details about this argument at the link above. > > But why then everyone is so excited about Iceberg then and push databases > to add Iceberg catalogs (essentially turning themselves into processing > engines), in addition to their native storage formats? > 1) Cheap storage -- not due to Iceberg per se, but due to object storage. > Snowflake, Redshift, Firebolt, Databend, and other DBs have their own > object storage-first storage formats, too, and many other DBs can be > configured with aggressive tiering to object storage. > 2) Interop with different processing engines -- Arrow Flight provides > this, too > 3) Ability to lift and move data to another DB or cloud -- this is fair, > but the mere ability of the DB to write Iceberg permits the data team to > convert and move *if* they decide to do it, but before that, why wouldn't > they use DB's native format, more optimised for this DB, the ingestion > pattern, etc.? > 4) *Parallel writing* -- processing engine's nodes can write parquet files > completely in parallel within an Iceberg transaction, Arrow Flight doesn't > permit this > 5) *Reading compressed data* from object storage to external processing > engines -- Arrow Flight requires convert everything into Arrow before > sending. If the processing job requires to pull a lot of Parquet > files/columns (even after filtering partitions) converting them all to > Arrow before sending as the minimum can increase the job latency quite a > bit, as the maximum will cost much more money for egress. > > I cannot think of any other substantial reason (when compared with DBs > rather than with Hive tables). If this is the case, Arrow Flight is not yet > a full-standing alternative to Iceberg catalogs for these three reasons: > 1) *No parallel writes* > 2) *Inability to send the table column data in the on-disk compressed > formats* (rather than Arrow), and > 3) A few other provisions in the protocol design that don't make it easy > to control/balance the load of DB nodes, failover to replica nodes, the > ability to re-request partition's endpoint from the server "in the same > query" (that is, at the same consistency snapshot), etc. > > These are the things that I tried to address in Table Read > <https://engineeringideas.substack.com/i/148153268/table-read-protocol-walkthrough> > and Table Write > <https://engineeringideas.substack.com/i/148485754/use-cases-for-table-write-protocol> > protocols. > > So, I think the vendors of DBs like ClickHouse, Databend, Druid, Doris, > InfluxDB, StarRocks, SingleStore, and others might be interested in these > new protocols to gain foothold because it would enable these DBs to compete > on storage formats, and indexing/ingestion approaches, rather than be > reduced to "processing engines for Iceberg", where there are already far > too many other options. > > Specialised and "challenger" processing engine vendors might be interested, > too, as table transfer protocols that I proposed are kind of "Arrow > Flight++", maybe there is a room for something like "DataFusion Ballista++" > where table transfer protocols are used as source and sink protocols > instead of Arrow Flight. > > I'm interested in developing these protocols, but it requires some > preliminary interest from vendors to validate the idea. Other feedback is > welcome, too. >