Re: [Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

Kevin Liu Wed, 16 Oct 2024 11:41:32 -0700

Hey folks,

Thanks for the discussions.

It seems everyone is in favor of replacing the Hadoop catalog example, and
the question now is whether to replace it with the JDBC catalog or the REST
catalog.

I originally proposed the JDBC catalog as a replacement primarily due to
its ease of use. Users can quickly set up a JDBC catalog backed by an
in-memory or file-based datastore without needing additional
infrastructure. It also aligns with the quick-start ethos of "it just
works." That said, I agree that an example of setting up the REST catalog
should be part of the getting-started guide since it’s the catalog the
community has aligned on.

Here's what I propose as a middle-ground.

   1. We replace the Hadoop catalog example with a JDBC catalog backed by
   an in-memory datastore. This allows users to get started without needing
   additional infrastructure, which was one of the main benefits of the Hadoop
   catalog.
   2. We add a new section describing the REST catalog, its benefits, and
   how to set one up. We can use the REST catalog adapter [1], with the
   adapter using the JDBC catalog as its internal catalog.

This approach gives users a way to quickly prototype while also guiding
them toward the REST catalog for production use cases.

Looking forward to hearing more from you all.

Best,

Kevin Liu

[1] https://lists.apache.org/thread/xl1cwq7vmnh6zgfd2vck2nq7dfd33ncq

On Thu, Oct 10, 2024 at 3:44 AM Eduard Tudenhöfner <etudenhoef...@apache.org>
wrote:

> I would prefer to advocate for the REST catalog in those examples/docs
> (similar to how the Spark quickstart example
> <https://iceberg.apache.org/spark-quickstart/> uses the REST catalog).
> The docs could then refer to the quickstart example to indicate what's
> required in terms of services to be started before a user can spawn a spark
> shell.
>
> On Thu, Oct 10, 2024 at 12:15 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
>> Hi
>>
>> As we are talking about "documentation" (quick start/readme), I would
>> rather propose to use the REST catalog here instead of JDBC.
>>
>> As it's the catalog we "promote", I think it would be valuable for
>> users to start with the "right thing".
>>
>> JDBC Catalog is interesting for quick test/started guide, but we know
>> how it goes: it will be heavily use (see what happened with the
>> HadoopCatalog used in production whereas it should not :) ).
>>
>> Regards
>> JB
>>
>> On Tue, Oct 8, 2024 at 12:18 PM Kevin Liu <kevin.jq....@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I wanted to bring up a suggestion regarding our current documentation.
>> The existing examples for Iceberg often use the Hadoop catalog, as seen in:
>> >
>> > Adding a Catalog - Spark Quickstart [1]
>> > Adding Catalogs - Spark Getting Started [2]
>> >
>> > Since we generally advise against using Hadoop catalogs in production
>> environments, I believe it would be beneficial to replace these examples
>> with ones that use the JDBC catalog. The JDBC catalog, configured with a
>> local SQLite database file, offers similar convenience but aligns better
>> with production best practices.
>> >
>> > I've created an issue [3] and a PR [4] to address this. Please take a
>> look, and I'd love to hear your thoughts on whether this is a direction we
>> want to pursue.
>> >
>> > Best,
>> > Kevin Liu
>> >
>> > [1] https://iceberg.apache.org/spark-quickstart/#adding-a-catalog
>> > [2]
>> https://iceberg.apache.org/docs/nightly/spark-getting-started/#adding-catalogs
>> > [3] https://github.com/apache/iceberg/issues/11284
>> > [4] https://github.com/apache/iceberg/pull/11285
>> >
>>
>

Re: [Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

Reply via email to