Re: Re: [DISCUSS] Apache Amoro proposal

Jean-Baptiste Onofré Fri, 01 Mar 2024 07:49:44 -0800

Hi Nathan,

Thanks for the detailed information. Much appreciated.


I now have a better understanding of the goals. It looks interesting.

Happy to help as a mentor if you need.

Thanks !
Regards
JB

On Sat, Feb 24, 2024 at 6:24 AM nathan ma <majin1...@gmail.com> wrote:
>
> hi, JB
>
> As co-creator of this project, I’d love to explain more about the
> positioning of lakehouse management system.
>
> When discussing databases or traditional data warehouses, we often used the
> term DBMS (Database Management System) to describe them. Traditional
> databases, including MPP databases, are typically considered “out-of-box”
> solutions. Unlike big data systems, they don’t require various components
> like compute engines, data lake formats, or metadata stores. When we need a
> database management tool, lightweight options like Navicat are commonly
> used.
>
> If we further abstract the capabilities of a DBMS and map them to the
> modern data stack, we find that the data read/write part of a DBMS is now
> shared among different compute engines such as Spark, Flink, Trino, and
> cloud-native services like Athena. Another part of a DBMS deals with data
> files, index files, and metadata (also known as the information schema)
> maintenance. Currently, there are successful open-source and commercial
> projects dedicated to managing metadata, such as HiveMetastore,
> UnityCatalog, and more recently, Gravitino. In practice, developers often
> combine these projects with compute engines to optimize data files. For
> example, many commercial compute engines include an optimize command.
>
> Amoro, as a lakehouse management system, aims to encapsulate the
> maintenance and management of data lake files, index files, and metadata in
> a way that is transparent and easy-to-use for users. The richness of
> diverse computing engines is a distinctive feature of the modern data
> stack, opening up a multitude of possibilities for various application
> scenarios. Additionally, concerning the part analogous to DBMS, we aspire
> to have a mature system in place—one that seamlessly accommodates data
> written to the lakehouse by any engine, in any manner, ensuring high data
> availability across all other engines. For instance, when Flink writes to
> Iceberg, Amoro’s self-optimizing mechanism ensures efficient data analysis
> performance by Trino or other engines while controlling compacting costs.
> Additionally, Amoro handles historical data, snapshots, and orphan file
> cleanup in the background.
>
> By positioning Amoro in this way, we aim to provide an ‘out-of-box’
> experience that feels as straightforward as traditional DBMS while keeping
> openness to various computing engines. At the same time, Amoro hopes to
> empower data product builders with a lightweight solution that integrates
> seamlessly into their modern data workflows.
>
>
>
> Thanks.
> Jin Ma
>
>
> On 2024/02/23 14:16:43 Jean-Baptiste Onofré wrote:
> > Hi Justin
> >
> > Even if it looks interesting, I'm not sure to understand exactly the
> > purpose of the proposal.
> >
> > What lakehouse management system means exactly ? Is it an abstraction
> > layer on top of Iceberg, Paimon + query engine powered by Flink,
> > Spark, Trino ?
> >
> > Please let me know if you want an additional mentor, I would be happy to
> help.
> >
> > Thanks !
> > Regards
> > JB
> >
> > On Fri, Feb 23, 2024 at 9:44 AM Justin Mclean <ju...@classsoftware.com>
> wrote:
> > >
> > > Hi,
> > >
> > > I would like to propose a new project to the ASF incubator - Apache
> Amoro. I’m one of the mentors, but there are a lot of other people involved
> who have done all of the hard work.
> > >
> > > Amoro is a Lakehouse management system built on open data lake formats
> like Apache Iceberg and Apache Paimon (Incubating). Working with compute
> engines including Apache Flink, Apache Spark, and Trino, Amoro brings
> pluggable and self-managed features for Lakehouse to provide out-of-the-box
> data warehouse experience, and helps data platforms or products easily
> build infra-decoupled, stream-and-batch-fused and lake-native architecture.
> You can find the proposal here. [1]
> > >
> > > We are looking forward to anyone's feedback or questions.
> > >
> > > Thanks,
> > > Justin
> > >
> > > [1] https://cwiki.apache.org/confluence/display/INCUBATOR/AmoroProposal
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > > For additional commands, e-mail: general-h...@incubator.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > For additional commands, e-mail: general-h...@incubator.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: Re: [DISCUSS] Apache Amoro proposal

Reply via email to