Hi Roman,

I tried to skim your writing at least on the read side. IIUC you are
advocating for something which to some extent is a generalization of
Iceberg's REST catalog planning API [1].  But instead of just Iceberg, it
would encompass other formats, and also be more flexible on negotiating
which parts of the computation client vs server are responsible for? The
main downside to this approach is you end up requiring client side
libraries for any potential scan operation (e.g. Iceberg, Parquet, Nimble,
etc) or you need to negotiate with the server on supported tasks.  This is
solvable but adds complexity to deployments and operational debuggability.


5) *Reading compressed data* from object storage to external processing
> engines -- Arrow Flight requires convert everything into Arrow before
> sending. If the processing job requires to pull a lot of Parquet
> files/columns (even after filtering partitions) converting them all to
> Arrow before sending as the minimum can increase the job latency quite a
> bit, as the maximum will cost much more money for egress.


The Disassociated IPC protocol [2] was a start in this direction, and I
believe the original protocol mentioned referencing parquet files (which
was removed).

I'm interested in developing these protocols, but it requires some
> preliminary interest from vendors to validate the idea. Other feedback is
> welcome, too


I think this is a good idea, but if you really want to build it I would not
wait for vendors to validate the idea.  Arrow Flight/ADBC are still at the
vanguard of what most vendors support, and at least a few have chosen to
construct something similar but more specific to their use cases.  Apache
xTable [3] is another project that it might be worth discussing this
approach with.


[1]
https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L596
[2] https://arrow.apache.org/docs/format/DissociatedIPC.html
[3] https://xtable.apache.org/

On Mon, Sep 9, 2024 at 10:29 AM Roman Leventov <leven...@apache.org> wrote:

> Hello Arrow community,
>
> I've recently proposed a family of "table transfer protocols": Table Read,
> Table Write, and Table Replication protocols (provisional names) that I
> think can become a full-standing alternative to Iceberg for interop between
> diverse databases/data warehouses and processing and ML engines:
> https://engineeringideas.substack.com/p/table-transfer-protocols-improved.
>
> It appears to me that the "convergence on Iceberg" doesn't really serve the
> users because the competition in the storage layer is inhibited and
> everyone is using the "lowest common denominator" format that has
> limitations. See more details about this argument at the link above.
>
> But why then everyone is so excited about Iceberg then and push databases
> to add Iceberg catalogs (essentially turning themselves into processing
> engines), in addition to their native storage formats?
>  1) Cheap storage -- not due to Iceberg per se, but due to object storage.
> Snowflake, Redshift, Firebolt, Databend, and other DBs have their own
> object storage-first storage formats, too, and many other DBs can be
> configured with aggressive tiering to object storage.
>  2) Interop with different processing engines -- Arrow Flight provides
> this, too
>  3) Ability to lift and move data to another DB or cloud -- this is fair,
> but the mere ability of the DB to write Iceberg permits the data team to
> convert and move *if* they decide to do it, but before that, why wouldn't
> they use DB's native format, more optimised for this DB, the ingestion
> pattern, etc.?
> 4) *Parallel writing* -- processing engine's nodes can write parquet files
> completely in parallel within an Iceberg transaction, Arrow Flight doesn't
> permit this
> 5) *Reading compressed data* from object storage to external processing
> engines -- Arrow Flight requires convert everything into Arrow before
> sending. If the processing job requires to pull a lot of Parquet
> files/columns (even after filtering partitions) converting them all to
> Arrow before sending as the minimum can increase the job latency quite a
> bit, as the maximum will cost much more money for egress.
>
> I cannot think of any other substantial reason (when compared with DBs
> rather than with Hive tables). If this is the case, Arrow Flight is not yet
> a full-standing alternative to Iceberg catalogs for these three reasons:
>  1) *No parallel writes*
>  2) *Inability to send the table column data in the on-disk compressed
> formats* (rather than Arrow), and
>  3) A few other provisions in the protocol design that don't make it easy
> to control/balance the load of DB nodes, failover to replica nodes, the
> ability to re-request partition's endpoint from the server "in the same
> query" (that is, at the same consistency snapshot), etc.
>
> These are the things that I tried to address in Table Read
> <
> https://engineeringideas.substack.com/i/148153268/table-read-protocol-walkthrough
> >
> and Table Write
> <
> https://engineeringideas.substack.com/i/148485754/use-cases-for-table-write-protocol
> >
> protocols.
>
> So, I think the vendors of DBs like ClickHouse, Databend, Druid, Doris,
> InfluxDB, StarRocks, SingleStore, and others might be interested in these
> new protocols to gain foothold because it would enable these DBs to compete
> on storage formats, and indexing/ingestion approaches, rather than be
> reduced to "processing engines for Iceberg", where there are already far
> too many other options.
>
> Specialised and "challenger" processing engine vendors might be interested,
> too, as table transfer protocols that I proposed are kind of "Arrow
> Flight++", maybe there is a room for something like "DataFusion Ballista++"
> where table transfer protocols are used as source and sink protocols
> instead of Arrow Flight.
>
> I'm interested in developing these protocols, but it requires some
> preliminary interest from vendors to validate the idea. Other feedback is
> welcome, too.
>

Reply via email to