[DISCUSS] Incubating Proposal for Paimon

Yu Li Thu, 23 Feb 2023 19:48:47 -0800

Hi All,


I would like to propose Paimon [1] as a new apache incubator project, and
you can find the proposal [2] of Paimon for more details.


Paimon is a unified lake storage to build dynamic tables for both stream
and batch processing with big data compute engines (Apache Flink, Apache
Spark, Apache
Hive, Trino, etc.), supporting high-speed data ingestion and real-time
data query.
With the adoption of stream processing in production, there is an
increasing demand for storage to simultaneously support updates,
deletes and streaming reads,
which cannot be fully satisfied by existing lake storages. To tackle these
new challenges, Paimon
natively adopts LSM (Log-Structured Merge-tree) as its underlying data
structure, and provides enhanced performance for data with primary
keys
(besides
the common lake storage capabilities). What's more, Paimon supports
both batch and stream operations (reads and writes), facilitating
applications pursuing batch-stream-unified semantics. Specifically:


1. Paimon provides excellent performance on the intensive update
/ delete workload, leveraging the append-write feature of the LSM data
structure.

2. Paimon utilizes the ordered feature of LSM to support effective filter
pushdown, and could reduce
the latency of queries with primary key filtering to milliseconds.

3.
Paimon supports various (row-based or row-columnar) file formats
including Apache Avro, Apache ORC and Apache Parquet (rows will be
sorted by the primary key before writing out).

4.
Tables provided by Paimon can be queried by various engines, including
Apache Flink, Apache Spark, Apache Hive, Trino, etc.

5.
Paimon's metadata is self-managed, stored on the distributed file
system and can be synchronized to Hive metastore (HMS).

6.
Besides the common batch read and write support, Paimon also supports
streaming read and change data feed.


Paimon has been used by various users and companies, including
Alibaba, Bilibili, ByteDance and so on. Paimon is also integrated into
Alibaba Cloud's E-MapReduce and Realtime Compute products to provide
cloud services.


Paimon was founded in the Flink community in 2022 with the name of
"Flink Table Store”.
It has been developed for more than one year and produced 4 formal
releases. As its adoption expands to more computing engines, some of
the ecology users express their concerns about the neutrality of the
project. This makes us rethink the positioning of Flink Table Store,
which can be an independent lake storage.


With adequate discussions, we have got the support from the Flink
community to enter Apache incubation
[3] [4], with the below expectations:

1.
Expand Paimon's ecosystem, providing independent Java APIs to support
reading and writing from more big data engines such as Apache
Doris, Apache Hive, Apache Presto, Apache Spark, Trino, etc.

2.
Supplement key capabilities, especially streaming reads and intensive
updates/deletes,  for creating a unified and easy-to-use streaming
data warehouse (lakehouse).

3. Grow into a more vibrant and neutral open source community.


And we believe the Paimon project will provide tremendous value for the
community if it is introduced into the Apache incubator.


I will help this project as the champion and mentor the project together
with three other mentors (many thanks):


* Becket Qin (j...@apache.org)

* Robert Metzger (rmetz...@apache.org)

* Stephan Ewen (se...@apache.org)


Look forward to your feedback. Thanks.


Best Regards,
Yu

[1] https://github.com/apache/flink-table-store
<https://github.com/alibaba/RemoteShuffleService>

[2] https://cwiki.apache.org/confluence/display/INCUBATOR/PaimonProposal

[3] https://lists.apache.org/thread/2ybxfg3zrzn4l3tnq3w2w3xvkhk0f9jk

[4] https://lists.apache.org/thread/kn7c08cr4l0ynt551yfjqvzh5ns226r6

[DISCUSS] Incubating Proposal for Paimon

Reply via email to