Thanks Alex, please find my in-line answers: 1. When you say having both catalogs side by side, are you planning to have the same tables in both catalogs? This could be tricky consistency wise as writes would need only update one catalog.
*-- My understanding is that Nessie primarily integrates with Apache Iceberg, offering versioned metadata management specifically for Iceberg tables. Therefore, if there is a need to maintain non-Iceberg tables alongside Iceberg tables, the Hive Metastore would still be necessary. Please correct me if I am mistaken.* 2. The main reason to have a Nessie catalog would be for its catalog versioning features. If you do want these features why not just use Nessie in Spark with Hadoop as the store vs using Hive? *-- Currently, there is limited documentation on utilizing Nessie as the primary catalog (spark_catalog) to replace the Hive metastore. While Nessie offers robust catalog versioning features, its integration with Spark and Hadoop as the primary store is not well-documented* 3. Nessie has OAuth/token and AWS authentication available, you can find more details in the docs at projectnessie.org. How would you want authentication to be handled? *-- While projectnessie.org <http://projectnessie.org> provides documentation on authentication methods such as OAuth/token and AWS, I didn't find specific information on authorization. As for my requirements, I need to restrict user access to certain databases and tables, which necessitates robust authorization mechanisms (from spark/trino/dremio). However, I'm uncertain whether Nessie supports integration with Apache Ranger for fine-grained access control.* Are there any other tools you plan on accessing your tables with? Then catalog access of these tools should be considered as well. * -- Our primary compute framework for data ingestion and semantic processing will be Spark. Additionally, we'll leverage Dremio/Trino for interactive data analytics. * Another concern is that for Iceberg table maintenance with a Nessie catalog, I need to use both the Nessie-GC tool and Spark. This means maintaining two different frameworks for table optimization instead of relying solely on Spark. Additionally, even with Spark, I have faced a lot of issues optimizing really large Iceberg tables. With only the Nessie-GC tool, I am really concerned about performance. Is there any performance benchmark that you can share? Your insights would be greatly appreciated. Thanks in advance for your assistance. Pani On Mon, May 6, 2024 at 6:13 PM Alex Merced <alex.mer...@dremio.com.invalid> wrote: > A few questions/thoughts: > > 1. When you say having both catalogs side by side, are you planning to > have the same tables in both catalogs? This could be tricky consistency > wise as writes would need only update one catalog. > > 2. The main reason to have a Nessie catalog would be for its catalog > versioning features. If you do want these features why not just use Nessie > in Spark with Hadoop as the store vs using Hive? > > 3. Nessie has OAuth/token and AWS authentication available, you can find > more details in the docs at projectnessie.org. How would you want > authentication to be handled? > > Are there any other tools you plan on accessing your tables with? Then > catalog access of these tools should be considered as well. > > *Alex Merced <https://bio.alexmerced.com/data> * > *Senior Tech Evangeslit, Dremio **Dremio.com* > <https://www.dremio.com/?utm_medium=email&utm_source=signature&utm_term=na&utm_content=email-signature&utm_campaign=email-signature>*/ > **Follow Us on LinkedIn!* <https://www.linkedin.com/company/dremio> > *Resources for Getting Hands-on with Apache Iceberg/Dremio* > <https://medium.com/data-engineering-with-dremio/a-deep-intro-to-apache-iceberg-and-resources-for-learning-more-be51535cff74> > > > On Mon, May 6, 2024 at 6:40 PM Pani Dhakshnamurthy <dpa...@gmail.com> > wrote: > >> Hello Everyone, >> I'm currently planning to build a Lakehouse solution, leveraging Apache >> Spark and Hive Metastore in my Hadoop workflows. I'm also exploring the >> potential benefits of integrating Nessie as my catalog solution. >> >> With this setup, we'll have two catalogs, and we anticipate working with >> non-Iceberg tables alongside Iceberg tables. I'm keen to understand if >> Nessie offers any advantages over Hive Metastore for Iceberg tables and if >> anyone has experience using Nessie alongside Hive Metastore in production >> environments. >> >> Additionally, I'd like to know if there are any known issues or >> challenges associated with having both solutions in place, especially >> regarding Iceberg table maintenance. Also, could you provide insights into >> the authentication and authorization mechanisms available for Nessie? >> >> Your insights and guidance on these points would be highly valuable. >> >> Thank you for your time and assistance! >> >> Best regards, >> Pani >> >