Thanks for starting this conversation Roman, It seems like one potential assumption in the blog post[1] is that every query on data stored in Iceberg (or other table format) will directly access object storage, which is inefficient and slow.
While the need to access object store is certainly true on the first access, I personally think another interesting angle is OLAP systems that accelerate access to data stored in table formats using various indexes and caches. There is no reason a system can't provide results in milliseconds after it is written (as we do in InfluxDB 3.0) even if data eventually landd in object storage. This is achieved, unsurprisingly, by having metadata and data stored closer to the compute in some combination of node-local and cluster-local caches. I believe this is also at a high level how Snowflake works Thus I disagree with the premise that Iceberg and other table formats are not the future of OLAP. Instead I see them as the foundational layer (mostly due to avoiding data gravity / vendor lockin) on which a new ecosystem of OLAP tools will be developed A more efficient / performant version of Arrow Flight (for table read / write) does sound interesting, but I think many vendors will likely implement a custom proprietary protocol for communicating between their own services if Flight isn't good enough. Just my thoughts, Andrew [1] https://engineeringideas.substack.com/p/the-future-of-olap-table-storage On Mon, Sep 9, 2024 at 1:29 PM Roman Leventov <leven...@apache.org> wrote: > Hello Arrow community, > > I've recently proposed a family of "table transfer protocols": Table Read, > Table Write, and Table Replication protocols (provisional names) that I > think can become a full-standing alternative to Iceberg for interop between > diverse databases/data warehouses and processing and ML engines: > https://engineeringideas.substack.com/p/table-transfer-protocols-improved. > > It appears to me that the "convergence on Iceberg" doesn't really serve the > users because the competition in the storage layer is inhibited and > everyone is using the "lowest common denominator" format that has > limitations. See more details about this argument at the link above. > > But why then everyone is so excited about Iceberg then and push databases > to add Iceberg catalogs (essentially turning themselves into processing > engines), in addition to their native storage formats? > 1) Cheap storage -- not due to Iceberg per se, but due to object storage. > Snowflake, Redshift, Firebolt, Databend, and other DBs have their own > object storage-first storage formats, too, and many other DBs can be > configured with aggressive tiering to object storage. > 2) Interop with different processing engines -- Arrow Flight provides > this, too > 3) Ability to lift and move data to another DB or cloud -- this is fair, > but the mere ability of the DB to write Iceberg permits the data team to > convert and move *if* they decide to do it, but before that, why wouldn't > they use DB's native format, more optimised for this DB, the ingestion > pattern, etc.? > 4) *Parallel writing* -- processing engine's nodes can write parquet files > completely in parallel within an Iceberg transaction, Arrow Flight doesn't > permit this > 5) *Reading compressed data* from object storage to external processing > engines -- Arrow Flight requires convert everything into Arrow before > sending. If the processing job requires to pull a lot of Parquet > files/columns (even after filtering partitions) converting them all to > Arrow before sending as the minimum can increase the job latency quite a > bit, as the maximum will cost much more money for egress. > > I cannot think of any other substantial reason (when compared with DBs > rather than with Hive tables). If this is the case, Arrow Flight is not yet > a full-standing alternative to Iceberg catalogs for these three reasons: > 1) *No parallel writes* > 2) *Inability to send the table column data in the on-disk compressed > formats* (rather than Arrow), and > 3) A few other provisions in the protocol design that don't make it easy > to control/balance the load of DB nodes, failover to replica nodes, the > ability to re-request partition's endpoint from the server "in the same > query" (that is, at the same consistency snapshot), etc. > > These are the things that I tried to address in Table Read > < > https://engineeringideas.substack.com/i/148153268/table-read-protocol-walkthrough > > > and Table Write > < > https://engineeringideas.substack.com/i/148485754/use-cases-for-table-write-protocol > > > protocols. > > So, I think the vendors of DBs like ClickHouse, Databend, Druid, Doris, > InfluxDB, StarRocks, SingleStore, and others might be interested in these > new protocols to gain foothold because it would enable these DBs to compete > on storage formats, and indexing/ingestion approaches, rather than be > reduced to "processing engines for Iceberg", where there are already far > too many other options. > > Specialised and "challenger" processing engine vendors might be interested, > too, as table transfer protocols that I proposed are kind of "Arrow > Flight++", maybe there is a room for something like "DataFusion Ballista++" > where table transfer protocols are used as source and sink protocols > instead of Arrow Flight. > > I'm interested in developing these protocols, but it requires some > preliminary interest from vendors to validate the idea. Other feedback is > welcome, too. >