Re: Exploring Nessie Integration for Iceberg Tables: Insights Needed

Pani Dhakshnamurthy Mon, 06 May 2024 19:49:53 -0700

Thanks Alex, please find my in-line answers:

1. When you say having both catalogs side by side, are you planning to have
the same tables in both catalogs? This could be tricky consistency wise as
writes would need only update one catalog.

*--  My understanding is that Nessie primarily integrates with Apache
Iceberg, offering versioned metadata management specifically for Iceberg
tables. Therefore, if there is a need to maintain non-Iceberg tables
alongside Iceberg tables, the Hive Metastore would still be necessary.
Please correct me if I am mistaken.*

2. The main reason to have a Nessie catalog would be for its catalog
versioning features. If you do want these features why not just use Nessie
in Spark with Hadoop as the store vs using Hive?

*-- Currently, there is limited documentation on utilizing Nessie as the
primary catalog (spark_catalog) to replace the Hive metastore. While Nessie
offers robust catalog versioning features, its integration with Spark and
Hadoop as the primary store is not well-documented*

3. Nessie has OAuth/token and AWS authentication available, you can find
more details in the docs at projectnessie.org. How would you want
authentication to be handled?

*-- While projectnessie.org <http://projectnessie.org> provides
documentation on authentication methods such as OAuth/token and AWS, I
didn't find specific information on authorization. As for my requirements,
I need to restrict user access to certain databases and tables, which
necessitates robust authorization mechanisms (from spark/trino/dremio).
However, I'm uncertain whether Nessie supports integration with Apache
Ranger for fine-grained access control.*

Are there any other tools you plan on accessing your tables with? Then
catalog access of these tools should be considered as well.
 * -- Our primary compute framework for data ingestion and semantic
processing will be Spark. Additionally, we'll leverage Dremio/Trino for
interactive data analytics. *

Another concern is that for Iceberg table maintenance with a Nessie
catalog, I need to use both the Nessie-GC tool and Spark. This means
maintaining two different frameworks for table optimization instead of
relying solely on Spark. Additionally, even with Spark, I have faced a lot
of issues optimizing really large Iceberg tables.
With only the Nessie-GC tool, I am really concerned about performance. Is
there any performance benchmark that you can share?

Your insights would be greatly appreciated.

Thanks in advance for your assistance.
Pani

On Mon, May 6, 2024 at 6:13 PM Alex Merced <alex.mer...@dremio.com.invalid>
wrote:

> A few questions/thoughts:
>
> 1. When you say having both catalogs side by side, are you planning to
> have the same tables in both catalogs? This could be tricky consistency
> wise as writes would need only update one catalog.
>
> 2. The main reason to have a Nessie catalog would be for its catalog
> versioning features. If you do want these features why not just use Nessie
> in Spark with Hadoop as the store vs using Hive?
>
> 3. Nessie has OAuth/token and AWS authentication available, you can find
> more details in the docs at projectnessie.org. How would you want
> authentication to be handled?
>
> Are there any other tools you plan on accessing your tables with? Then
> catalog access of these tools should be considered as well.
>
> *Alex Merced <https://bio.alexmerced.com/data> *
> *Senior Tech Evangeslit, Dremio **Dremio.com*
> <https://www.dremio.com/?utm_medium=email&utm_source=signature&utm_term=na&utm_content=email-signature&utm_campaign=email-signature>*/
> **Follow Us on LinkedIn!* <https://www.linkedin.com/company/dremio>
> *Resources for Getting Hands-on with Apache Iceberg/Dremio*
> <https://medium.com/data-engineering-with-dremio/a-deep-intro-to-apache-iceberg-and-resources-for-learning-more-be51535cff74>
>
>
> On Mon, May 6, 2024 at 6:40 PM Pani Dhakshnamurthy <dpa...@gmail.com>
> wrote:
>
>> Hello Everyone,
>> I'm currently planning to build a Lakehouse solution, leveraging Apache
>> Spark and Hive Metastore in my Hadoop workflows. I'm also exploring the
>> potential benefits of integrating Nessie as my catalog solution.
>>
>> With this setup, we'll have two catalogs, and we anticipate working with
>> non-Iceberg tables alongside Iceberg tables. I'm keen to understand if
>> Nessie offers any advantages over Hive Metastore for Iceberg tables and if
>> anyone has experience using Nessie alongside Hive Metastore in production
>> environments.
>>
>> Additionally, I'd like to know if there are any known issues or
>> challenges associated with having both solutions in place, especially
>> regarding Iceberg table maintenance. Also, could you provide insights into
>> the authentication and authorization mechanisms available for Nessie?
>>
>> Your insights and guidance on these points would be highly valuable.
>>
>> Thank you for your time and assistance!
>>
>> Best regards,
>> Pani
>>
>

Re: Exploring Nessie Integration for Iceberg Tables: Insights Needed

Reply via email to