Hello,
I'm perplexed by this discussion. If you want to send highly-compressed
files over the network that is already possible: just send Parquet
files via HTTP(S) (or another protocol of choice).
Arrow Flight is simply a *streaming* protocol that allows
sending/requesting the Arrow format over gRPC. The Arrow format was
never meant to be high-compression as it's primarily an in-memory
format, and only secondarily a transfer/storage format.
I don't think it's reasonable to enlarge the scope of Arrow to
something else.
Regards
Antoine.
On Tue, 10 Sep 2024 01:29:05 +0800
Roman Leventov <leven...@apache.org> wrote:
Hello Arrow community,
I've recently proposed a family of "table transfer protocols": Table
Read,
Table Write, and Table Replication protocols (provisional names) that I
think can become a full-standing alternative to Iceberg for interop
between
diverse databases/data warehouses and processing and ML engines:
https://engineeringideas.substack.com/p/table-transfer-protocols-improved.
It appears to me that the "convergence on Iceberg" doesn't really serve
the
users because the competition in the storage layer is inhibited and
everyone is using the "lowest common denominator" format that has
limitations. See more details about this argument at the link above.
But why then everyone is so excited about Iceberg then and push databases
to add Iceberg catalogs (essentially turning themselves into processing
engines), in addition to their native storage formats?
1) Cheap storage -- not due to Iceberg per se, but due to object
storage.
Snowflake, Redshift, Firebolt, Databend, and other DBs have their own
object storage-first storage formats, too, and many other DBs can be
configured with aggressive tiering to object storage.
2) Interop with different processing engines -- Arrow Flight provides
this, too
3) Ability to lift and move data to another DB or cloud -- this is fair,
but the mere ability of the DB to write Iceberg permits the data team to
convert and move *if* they decide to do it, but before that, why wouldn't
they use DB's native format, more optimised for this DB, the ingestion
pattern, etc.?
4) *Parallel writing* -- processing engine's nodes can write parquet
files
completely in parallel within an Iceberg transaction, Arrow Flight
doesn't
permit this
5) *Reading compressed data* from object storage to external processing
engines -- Arrow Flight requires convert everything into Arrow before
sending. If the processing job requires to pull a lot of Parquet
files/columns (even after filtering partitions) converting them all to
Arrow before sending as the minimum can increase the job latency quite a
bit, as the maximum will cost much more money for egress.
I cannot think of any other substantial reason (when compared with DBs
rather than with Hive tables). If this is the case, Arrow Flight is not
yet
a full-standing alternative to Iceberg catalogs for these three reasons:
1) *No parallel writes*
2) *Inability to send the table column data in the on-disk compressed
formats* (rather than Arrow), and
3) A few other provisions in the protocol design that don't make it easy
to control/balance the load of DB nodes, failover to replica nodes, the
ability to re-request partition's endpoint from the server "in the same
query" (that is, at the same consistency snapshot), etc.
These are the things that I tried to address in Table Read
<
https://engineeringideas.substack.com/i/148153268/table-read-protocol-walkthrough
and Table Write
<
https://engineeringideas.substack.com/i/148485754/use-cases-for-table-write-protocol
protocols.
So, I think the vendors of DBs like ClickHouse, Databend, Druid, Doris,
InfluxDB, StarRocks, SingleStore, and others might be interested in these
new protocols to gain foothold because it would enable these DBs to
compete
on storage formats, and indexing/ingestion approaches, rather than be
reduced to "processing engines for Iceberg", where there are already far
too many other options.
Specialised and "challenger" processing engine vendors might be
interested,
too, as table transfer protocols that I proposed are kind of "Arrow
Flight++", maybe there is a room for something like "DataFusion
Ballista++"
where table transfer protocols are used as source and sink protocols
instead of Arrow Flight.
I'm interested in developing these protocols, but it requires some
preliminary interest from vendors to validate the idea. Other feedback is
welcome, too.