Re: [Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

Kevin Liu Tue, 07 Jan 2025 09:31:05 -0800

Hey folks,

Happy new year! I want to bump this thread with the freshed PR #11845
<https://github.com/apache/iceberg/pull/11845>. I've applied the
recommendations from this thread.
The PR replaces examples of Hadoop catalog in the Getting Started pages
with the JDBC Catalog along with an added example of configuring the REST
Catalog.


Please take a look and let me know what you think.

Best,
Kevin Liu

On Thu, Oct 17, 2024 at 6:10 AM Marc Cenac <marc.ce...@datadoghq.com.invalid>
wrote:

> Hey Kevin,
>
> This approach sounds good to me and thanks for your work to improve
> the getting started docs!  I would consider using the file-based sqlite
> rather than in-memory since I've seen some users surprised when they
> realize their tables disappear from the catalog upon restart, but
> either way is a welcome change from the Hadoop catalog.
>
> Thanks!
> -Marc
>
> On Wed, Oct 16, 2024 at 1:42 PM Kevin Liu <kevin.jq....@gmail.com> wrote:
>
>> Hey folks,
>>
>>
>> Thanks for the discussions.
>>
>>
>> It seems everyone is in favor of replacing the Hadoop catalog example,
>> and the question now is whether to replace it with the JDBC catalog or the
>> REST catalog.
>>
>>
>> I originally proposed the JDBC catalog as a replacement primarily due to
>> its ease of use. Users can quickly set up a JDBC catalog backed by an
>> in-memory or file-based datastore without needing additional
>> infrastructure. It also aligns with the quick-start ethos of "it just
>> works." That said, I agree that an example of setting up the REST catalog
>> should be part of the getting-started guide since it’s the catalog the
>> community has aligned on.
>>
>>
>> Here's what I propose as a middle-ground.
>>
>>    1. We replace the Hadoop catalog example with a JDBC catalog backed
>>    by an in-memory datastore. This allows users to get started without 
>> needing
>>    additional infrastructure, which was one of the main benefits of the 
>> Hadoop
>>    catalog.
>>    2. We add a new section describing the REST catalog, its benefits,
>>    and how to set one up. We can use the REST catalog adapter [1], with the
>>    adapter using the JDBC catalog as its internal catalog.
>>
>>
>> This approach gives users a way to quickly prototype while also guiding
>> them toward the REST catalog for production use cases.
>>
>>
>> Looking forward to hearing more from you all.
>>
>>
>> Best,
>>
>> Kevin Liu
>>
>>
>> [1] https://lists.apache.org/thread/xl1cwq7vmnh6zgfd2vck2nq7dfd33ncq
>>
>>
>>
>> On Thu, Oct 10, 2024 at 3:44 AM Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>>
>>> I would prefer to advocate for the REST catalog in those examples/docs
>>> (similar to how the Spark quickstart example
>>> <https://iceberg.apache.org/spark-quickstart/> uses the REST catalog).
>>> The docs could then refer to the quickstart example to indicate what's
>>> required in terms of services to be started before a user can spawn a spark
>>> shell.
>>>
>>> On Thu, Oct 10, 2024 at 12:15 PM Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> As we are talking about "documentation" (quick start/readme), I would
>>>> rather propose to use the REST catalog here instead of JDBC.
>>>>
>>>> As it's the catalog we "promote", I think it would be valuable for
>>>> users to start with the "right thing".
>>>>
>>>> JDBC Catalog is interesting for quick test/started guide, but we know
>>>> how it goes: it will be heavily use (see what happened with the
>>>> HadoopCatalog used in production whereas it should not :) ).
>>>>
>>>> Regards
>>>> JB
>>>>
>>>> On Tue, Oct 8, 2024 at 12:18 PM Kevin Liu <kevin.jq....@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi all,
>>>> >
>>>> > I wanted to bring up a suggestion regarding our current
>>>> documentation. The existing examples for Iceberg often use the Hadoop
>>>> catalog, as seen in:
>>>> >
>>>> > Adding a Catalog - Spark Quickstart [1]
>>>> > Adding Catalogs - Spark Getting Started [2]
>>>> >
>>>> > Since we generally advise against using Hadoop catalogs in production
>>>> environments, I believe it would be beneficial to replace these examples
>>>> with ones that use the JDBC catalog. The JDBC catalog, configured with a
>>>> local SQLite database file, offers similar convenience but aligns better
>>>> with production best practices.
>>>> >
>>>> > I've created an issue [3] and a PR [4] to address this. Please take a
>>>> look, and I'd love to hear your thoughts on whether this is a direction we
>>>> want to pursue.
>>>> >
>>>> > Best,
>>>> > Kevin Liu
>>>> >
>>>> > [1] https://iceberg.apache.org/spark-quickstart/#adding-a-catalog
>>>> > [2]
>>>> https://iceberg.apache.org/docs/nightly/spark-getting-started/#adding-catalogs
>>>> > [3] https://github.com/apache/iceberg/issues/11284
>>>> > [4] https://github.com/apache/iceberg/pull/11285
>>>> >
>>>>
>>>

Re: [Discuss] Replace Hadoop Catalog Examples with JDBC Catalog in Documentation

Reply via email to