Re: [DISCUSS][Flight] Improved Arrow Flight as alternative to Iceberg for DB--engine interop

Antoine Pitrou Fri, 13 Sep 2024 01:43:47 -0700


Hello,


I'm perplexed by this discussion. If you want to send highly-compressed
files over the network that is already possible: just send Parquet
files via HTTP(S) (or another protocol of choice).

Arrow Flight is simply a *streaming* protocol that allows
sending/requesting the Arrow format over gRPC. The Arrow format was
never meant to be high-compression as it's primarily an in-memory
format, and only secondarily a transfer/storage format.

I don't think it's reasonable to enlarge the scope of Arrow to
something else.

Regards

Antoine.



On Tue, 10 Sep 2024 01:29:05 +0800
Roman Leventov <leven...@apache.org> wrote:
> Hello Arrow community,
> 
> I've recently proposed a family of "table transfer protocols": Table Read,
> Table Write, and Table Replication protocols (provisional names) that I
> think can become a full-standing alternative to Iceberg for interop between
> diverse databases/data warehouses and processing and ML engines:
> https://engineeringideas.substack.com/p/table-transfer-protocols-improved.
> 
> It appears to me that the "convergence on Iceberg" doesn't really serve the
> users because the competition in the storage layer is inhibited and
> everyone is using the "lowest common denominator" format that has
> limitations. See more details about this argument at the link above.
> 
> But why then everyone is so excited about Iceberg then and push databases
> to add Iceberg catalogs (essentially turning themselves into processing
> engines), in addition to their native storage formats?
>  1) Cheap storage -- not due to Iceberg per se, but due to object storage.
> Snowflake, Redshift, Firebolt, Databend, and other DBs have their own
> object storage-first storage formats, too, and many other DBs can be
> configured with aggressive tiering to object storage.
>  2) Interop with different processing engines -- Arrow Flight provides
> this, too
>  3) Ability to lift and move data to another DB or cloud -- this is fair,
> but the mere ability of the DB to write Iceberg permits the data team to
> convert and move *if* they decide to do it, but before that, why wouldn't
> they use DB's native format, more optimised for this DB, the ingestion
> pattern, etc.?
> 4) *Parallel writing* -- processing engine's nodes can write parquet files
> completely in parallel within an Iceberg transaction, Arrow Flight doesn't
> permit this
> 5) *Reading compressed data* from object storage to external processing
> engines -- Arrow Flight requires convert everything into Arrow before
> sending. If the processing job requires to pull a lot of Parquet
> files/columns (even after filtering partitions) converting them all to
> Arrow before sending as the minimum can increase the job latency quite a
> bit, as the maximum will cost much more money for egress.
> 
> I cannot think of any other substantial reason (when compared with DBs
> rather than with Hive tables). If this is the case, Arrow Flight is not yet
> a full-standing alternative to Iceberg catalogs for these three reasons:
>  1) *No parallel writes*
>  2) *Inability to send the table column data in the on-disk compressed
> formats* (rather than Arrow), and
>  3) A few other provisions in the protocol design that don't make it easy
> to control/balance the load of DB nodes, failover to replica nodes, the
> ability to re-request partition's endpoint from the server "in the same
> query" (that is, at the same consistency snapshot), etc.
> 
> These are the things that I tried to address in Table Read
> <https://engineeringideas.substack.com/i/148153268/table-read-protocol-walkthrough>
> and Table Write
> <https://engineeringideas.substack.com/i/148485754/use-cases-for-table-write-protocol>
> protocols.
> 
> So, I think the vendors of DBs like ClickHouse, Databend, Druid, Doris,
> InfluxDB, StarRocks, SingleStore, and others might be interested in these
> new protocols to gain foothold because it would enable these DBs to compete
> on storage formats, and indexing/ingestion approaches, rather than be
> reduced to "processing engines for Iceberg", where there are already far
> too many other options.
> 
> Specialised and "challenger" processing engine vendors might be interested,
> too, as table transfer protocols that I proposed are kind of "Arrow
> Flight++", maybe there is a room for something like "DataFusion Ballista++"
> where table transfer protocols are used as source and sink protocols
> instead of Arrow Flight.
> 
> I'm interested in developing these protocols, but it requires some
> preliminary interest from vendors to validate the idea. Other feedback is
> welcome, too.
>

Re: [DISCUSS][Flight] Improved Arrow Flight as alternative to Iceberg for DB--engine interop

Reply via email to