Hi All,
I would like to propose Paimon [1] as a new apache incubator project, and you can find the proposal [2] of Paimon for more details. Paimon is a unified lake storage to build dynamic tables for both stream and batch processing with big data compute engines (Apache Flink, Apache Spark, Apache Hive, Trino, etc.), supporting high-speed data ingestion and real-time data query. With the adoption of stream processing in production, there is an increasing demand for storage to simultaneously support updates, deletes and streaming reads, which cannot be fully satisfied by existing lake storages. To tackle these new challenges, Paimon natively adopts LSM (Log-Structured Merge-tree) as its underlying data structure, and provides enhanced performance for data with primary keys (besides the common lake storage capabilities). What's more, Paimon supports both batch and stream operations (reads and writes), facilitating applications pursuing batch-stream-unified semantics. Specifically: 1. Paimon provides excellent performance on the intensive update / delete workload, leveraging the append-write feature of the LSM data structure. 2. Paimon utilizes the ordered feature of LSM to support effective filter pushdown, and could reduce the latency of queries with primary key filtering to milliseconds. 3. Paimon supports various (row-based or row-columnar) file formats including Apache Avro, Apache ORC and Apache Parquet (rows will be sorted by the primary key before writing out). 4. Tables provided by Paimon can be queried by various engines, including Apache Flink, Apache Spark, Apache Hive, Trino, etc. 5. Paimon's metadata is self-managed, stored on the distributed file system and can be synchronized to Hive metastore (HMS). 6. Besides the common batch read and write support, Paimon also supports streaming read and change data feed. Paimon has been used by various users and companies, including Alibaba, Bilibili, ByteDance and so on. Paimon is also integrated into Alibaba Cloud's E-MapReduce and Realtime Compute products to provide cloud services. Paimon was founded in the Flink community in 2022 with the name of "Flink Table Storeā. It has been developed for more than one year and produced 4 formal releases. As its adoption expands to more computing engines, some of the ecology users express their concerns about the neutrality of the project. This makes us rethink the positioning of Flink Table Store, which can be an independent lake storage. With adequate discussions, we have got the support from the Flink community to enter Apache incubation [3] [4], with the below expectations: 1. Expand Paimon's ecosystem, providing independent Java APIs to support reading and writing from more big data engines such as Apache Doris, Apache Hive, Apache Presto, Apache Spark, Trino, etc. 2. Supplement key capabilities, especially streaming reads and intensive updates/deletes, for creating a unified and easy-to-use streaming data warehouse (lakehouse). 3. Grow into a more vibrant and neutral open source community. And we believe the Paimon project will provide tremendous value for the community if it is introduced into the Apache incubator. I will help this project as the champion and mentor the project together with three other mentors (many thanks): * Becket Qin (j...@apache.org) * Robert Metzger (rmetz...@apache.org) * Stephan Ewen (se...@apache.org) Look forward to your feedback. Thanks. Best Regards, Yu [1] https://github.com/apache/flink-table-store <https://github.com/alibaba/RemoteShuffleService> [2] https://cwiki.apache.org/confluence/display/INCUBATOR/PaimonProposal [3] https://lists.apache.org/thread/2ybxfg3zrzn4l3tnq3w2w3xvkhk0f9jk [4] https://lists.apache.org/thread/kn7c08cr4l0ynt551yfjqvzh5ns226r6