Re: Exploring Nessie Integration for Iceberg Tables: Insights Needed

Dmitri Bourlatchkov Wed, 08 May 2024 05:42:09 -0700

Hi Pani,

> My understanding is that Nessie primarily integrates with Apache Iceberg,
offering versioned metadata management specifically for Iceberg tables


ATM - yes.

> if there is a need to maintain non-Iceberg tables alongside Iceberg
tables, the Hive Metastore would still be necessary

Using Hive metastore for other tables should work fine. However, the same
(Iceberg) table can only be "owned" by one catalog (otherwise there are no
consistency guarantees).

> While Nessie offers robust catalog versioning features, its integration
with Spark and Hadoop as the primary store is not well-documented

https://projectnessie.org/iceberg/spark/

Does this page help? What would you like to be added / explained in more
details?

Hadoop itself is an opaque concept to Nessie. It is an Engine-side concern.
Nessie itself does not interact with storage.

> I didn't find specific information on authorization.

The version of Nessie available out-of-the-box in OSS supports AuthZ based
on CEL expressions.

https://projectnessie.org/nessie-latest/authorization/#example-authorization-rules

A custom AuthZ implementation is possible, but requires building a custom
Nessie server with AuthZ logic plugged in via the BatchAccessChecker
<https://github.com/projectnessie/nessie/blob/main/servers/services/src/main/java/org/projectnessie/services/authz/BatchAccessChecker.java>
SPI.

>  I need to use both the Nessie-GC tool and Spark. This means maintaining
two different frameworks for table optimization instead of relying solely
on Spark

Nessie GC is not about table optimization, it reclaims unused files.
Iceberg Optimization Spark procedures are quite usable with Nessie.

> I have faced a lot of issues optimizing really large Iceberg tables

It might be more convenient to use our Zulip chat for a more interactive
discussion of issues with using Nessie in practice. The join link is on the
main https://projectnessie.org/ page.

Cheers,
Dmitri.

On Mon, May 6, 2024 at 10:49 PM Pani Dhakshnamurthy <dpa...@gmail.com>
wrote:

> Thanks Alex, please find my in-line answers:
>
> 1. When you say having both catalogs side by side, are you planning to
> have the same tables in both catalogs? This could be tricky consistency
> wise as writes would need only update one catalog.
>
>
> *--  My understanding is that Nessie primarily integrates with Apache
> Iceberg, offering versioned metadata management specifically for Iceberg
> tables. Therefore, if there is a need to maintain non-Iceberg tables
> alongside Iceberg tables, the Hive Metastore would still be necessary.
> Please correct me if I am mistaken.*
>
>
> 2. The main reason to have a Nessie catalog would be for its catalog
> versioning features. If you do want these features why not just use Nessie
> in Spark with Hadoop as the store vs using Hive?
>
>
> *-- Currently, there is limited documentation on utilizing Nessie as the
> primary catalog (spark_catalog) to replace the Hive metastore. While Nessie
> offers robust catalog versioning features, its integration with Spark and
> Hadoop as the primary store is not well-documented*
>
> 3. Nessie has OAuth/token and AWS authentication available, you can find
> more details in the docs at projectnessie.org. How would you want
> authentication to be handled?
>
>
> *-- While projectnessie.org <http://projectnessie.org> provides
> documentation on authentication methods such as OAuth/token and AWS, I
> didn't find specific information on authorization. As for my requirements,
> I need to restrict user access to certain databases and tables, which
> necessitates robust authorization mechanisms (from spark/trino/dremio).
> However, I'm uncertain whether Nessie supports integration with Apache
> Ranger for fine-grained access control.*
>
>
> Are there any other tools you plan on accessing your tables with? Then
> catalog access of these tools should be considered as well.
>  * -- Our primary compute framework for data ingestion and semantic
> processing will be Spark. Additionally, we'll leverage Dremio/Trino for
> interactive data analytics. *
>
>
> Another concern is that for Iceberg table maintenance with a Nessie
> catalog, I need to use both the Nessie-GC tool and Spark. This means
> maintaining two different frameworks for table optimization instead of
> relying solely on Spark. Additionally, even with Spark, I have faced a lot
> of issues optimizing really large Iceberg tables.
> With only the Nessie-GC tool, I am really concerned about performance. Is
> there any performance benchmark that you can share?
>
> Your insights would be greatly appreciated.
>
> Thanks in advance for your assistance.
> Pani
>
>
>
> On Mon, May 6, 2024 at 6:13 PM Alex Merced <alex.mer...@dremio.com.invalid>
> wrote:
>
>> A few questions/thoughts:
>>
>> 1. When you say having both catalogs side by side, are you planning to
>> have the same tables in both catalogs? This could be tricky consistency
>> wise as writes would need only update one catalog.
>>
>> 2. The main reason to have a Nessie catalog would be for its catalog
>> versioning features. If you do want these features why not just use Nessie
>> in Spark with Hadoop as the store vs using Hive?
>>
>> 3. Nessie has OAuth/token and AWS authentication available, you can find
>> more details in the docs at projectnessie.org. How would you want
>> authentication to be handled?
>>
>> Are there any other tools you plan on accessing your tables with? Then
>> catalog access of these tools should be considered as well.
>>
>> *Alex Merced <https://bio.alexmerced.com/data> *
>> *Senior Tech Evangeslit, Dremio **Dremio.com*
>> <https://www.dremio.com/?utm_medium=email&utm_source=signature&utm_term=na&utm_content=email-signature&utm_campaign=email-signature>*/
>> **Follow Us on LinkedIn!* <https://www.linkedin.com/company/dremio>
>> *Resources for Getting Hands-on with Apache Iceberg/Dremio*
>> <https://medium.com/data-engineering-with-dremio/a-deep-intro-to-apache-iceberg-and-resources-for-learning-more-be51535cff74>
>>
>>
>> On Mon, May 6, 2024 at 6:40 PM Pani Dhakshnamurthy <dpa...@gmail.com>
>> wrote:
>>
>>> Hello Everyone,
>>> I'm currently planning to build a Lakehouse solution, leveraging Apache
>>> Spark and Hive Metastore in my Hadoop workflows. I'm also exploring the
>>> potential benefits of integrating Nessie as my catalog solution.
>>>
>>> With this setup, we'll have two catalogs, and we anticipate working with
>>> non-Iceberg tables alongside Iceberg tables. I'm keen to understand if
>>> Nessie offers any advantages over Hive Metastore for Iceberg tables and if
>>> anyone has experience using Nessie alongside Hive Metastore in production
>>> environments.
>>>
>>> Additionally, I'd like to know if there are any known issues or
>>> challenges associated with having both solutions in place, especially
>>> regarding Iceberg table maintenance. Also, could you provide insights into
>>> the authentication and authorization mechanisms available for Nessie?
>>>
>>> Your insights and guidance on these points would be highly valuable.
>>>
>>> Thank you for your time and assistance!
>>>
>>> Best regards,
>>> Pani
>>>
>>

Re: Exploring Nessie Integration for Iceberg Tables: Insights Needed

Reply via email to