Hi Nathan, Thanks for the detailed information. Much appreciated.
I now have a better understanding of the goals. It looks interesting. Happy to help as a mentor if you need. Thanks ! Regards JB On Sat, Feb 24, 2024 at 6:24 AM nathan ma <majin1...@gmail.com> wrote: > > hi, JB > > As co-creator of this project, I’d love to explain more about the > positioning of lakehouse management system. > > When discussing databases or traditional data warehouses, we often used the > term DBMS (Database Management System) to describe them. Traditional > databases, including MPP databases, are typically considered “out-of-box” > solutions. Unlike big data systems, they don’t require various components > like compute engines, data lake formats, or metadata stores. When we need a > database management tool, lightweight options like Navicat are commonly > used. > > If we further abstract the capabilities of a DBMS and map them to the > modern data stack, we find that the data read/write part of a DBMS is now > shared among different compute engines such as Spark, Flink, Trino, and > cloud-native services like Athena. Another part of a DBMS deals with data > files, index files, and metadata (also known as the information schema) > maintenance. Currently, there are successful open-source and commercial > projects dedicated to managing metadata, such as HiveMetastore, > UnityCatalog, and more recently, Gravitino. In practice, developers often > combine these projects with compute engines to optimize data files. For > example, many commercial compute engines include an optimize command. > > Amoro, as a lakehouse management system, aims to encapsulate the > maintenance and management of data lake files, index files, and metadata in > a way that is transparent and easy-to-use for users. The richness of > diverse computing engines is a distinctive feature of the modern data > stack, opening up a multitude of possibilities for various application > scenarios. Additionally, concerning the part analogous to DBMS, we aspire > to have a mature system in place—one that seamlessly accommodates data > written to the lakehouse by any engine, in any manner, ensuring high data > availability across all other engines. For instance, when Flink writes to > Iceberg, Amoro’s self-optimizing mechanism ensures efficient data analysis > performance by Trino or other engines while controlling compacting costs. > Additionally, Amoro handles historical data, snapshots, and orphan file > cleanup in the background. > > By positioning Amoro in this way, we aim to provide an ‘out-of-box’ > experience that feels as straightforward as traditional DBMS while keeping > openness to various computing engines. At the same time, Amoro hopes to > empower data product builders with a lightweight solution that integrates > seamlessly into their modern data workflows. > > > > Thanks. > Jin Ma > > > On 2024/02/23 14:16:43 Jean-Baptiste Onofré wrote: > > Hi Justin > > > > Even if it looks interesting, I'm not sure to understand exactly the > > purpose of the proposal. > > > > What lakehouse management system means exactly ? Is it an abstraction > > layer on top of Iceberg, Paimon + query engine powered by Flink, > > Spark, Trino ? > > > > Please let me know if you want an additional mentor, I would be happy to > help. > > > > Thanks ! > > Regards > > JB > > > > On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean <ju...@classsoftware.com> > wrote: > > > > > > Hi, > > > > > > I would like to propose a new project to the ASF incubator - Apache > Amoro. I’m one of the mentors, but there are a lot of other people involved > who have done all of the hard work. > > > > > > Amoro is a Lakehouse management system built on open data lake formats > like Apache Iceberg and Apache Paimon (Incubating). Working with compute > engines including Apache Flink, Apache Spark, and Trino, Amoro brings > pluggable and self-managed features for Lakehouse to provide out-of-the-box > data warehouse experience, and helps data platforms or products easily > build infra-decoupled, stream-and-batch-fused and lake-native architecture. > You can find the proposal here. [1] > > > > > > We are looking forward to anyone's feedback or questions. > > > > > > Thanks, > > > Justin > > > > > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org