Re: Storing catalog directly on object store

[email protected] Wed, 27 Nov 2024 10:52:09 -0800

> We deprecated this recently and we don't have to deprecate it if object
stores support atomic operations like this.

I disagree because this misses many of the reasons for deprecation. It
isn't just that S3 didn't support a `putIfAbsent` operation. Other object
stores did and there are still several problems with this approach. The
fundamental issue is that it is attempting to solve problems at the wrong
level.

One of the reasons why Iceberg exists is that we saw people doing the same
thing with Parquet. People were trying to solve problems with their tables
by attempting to modify Parquet in wacky ways, like wanting to replace
the footer to make schema changes. Schema evolution needed to be solved at
the table level and in this community we've always tried to solve problems
more directly and elegantly by addressing them at the right layer of the
stack.

Iceberg tables scale up existing atomic operations to make transactional
guarantees on very large tables. Object stores and file systems aren't well
suited for this task. Just like they were not sufficient to provide
transactional guarantees across files and partitions, the primitives you
can use aren't sufficient for a database. Storage capabilities are also not
the right place to deliver other catalog features, like basic CRUD
operations.

The addition of `putIfAbsent` to S3 doesn't support transactions where you
need to modify multiple tables and it also doesn't address cases like the
need to atomically rename and delete tables. Schemes that use `putIfAbsent`
also rely either on consistent listing a large prefix or on maintaining a
version-hint file. That version-hint file can be out of date, so even with
one you still need to list or iteratively attempt to read metadata files to
determine the latest.

Getting a file-only scheme right is complicated and is specific to a
particular store (both commits and version-hint handling). Local file
systems would use an exclusive create operation to commit, Hadoop uses
atomic rename, and object stores use different `putIfAbsent` operations.
Making this work across languages and engines requires a lot of work to
specify requirements and document issues, only to get to single-table
functionality that doesn't deliver the catalog-level primitives like atomic
rename that are commonly used.

In the end, catalog problems are best solved at the catalog layer, not
through an elaborate scheme that uses storage-layer primitives, just as it
was not a good idea to deliver table behaviors using similar storage-layer
schemes. Adding `putIfAbsent` to S3 doesn't change that design principle.

I sympathize with the idea that it would be great if you didn't need a
catalog. Simpler infrastructure is generally better.

But trying to avoid a catalog limits the capabilities of this
infrastructure, while setting people up for later failure. When I talk with
people that have been trying to avoid having a catalog, they tend to have
tables scattered across buckets that they need to track down, they lack
observability to know what is being used, don't to know if they are
deleting data in compliance with regulations, and they often lack simple
and usable access controls.

I think that the solution is to make it easier to run or use a catalog, not
to try to build without one.

And I'm also looking forward to what Jack is alluding to.

On Tue, Nov 26, 2024 at 11:05 PM Ajantha Bhat <[email protected]> wrote:

> Interesting.
>
> We already have file system tables [1] in Iceberg (HadoopCatalog
> implements this spec).
> We deprecated this recently and we don't have to deprecate it if object
> stores support atomic operations like this.
>
> [1] https://iceberg.apache.org/spec/#file-system-tables
>
> - Ajantha
>
> On Wed, Nov 27, 2024 at 2:53 AM Nikhil Benesch <[email protected]>
> wrote:
>
>> Ah, fascinating. Thanks very much for the pointer.
>>
>> Here's the thread introducing the proposal [0], for anyone else curious.
>>
>> [0]: https://lists.apache.org/thread/kh4n98w4z22sc8h2vot4q8n44vdtnltg
>>
>> On Tue, Nov 26, 2024 at 3:27 PM Jean-Baptiste Onofré <[email protected]>
>> wrote:
>> >
>> > Hi Vignesh
>> >
>> > Thanks for the reminder, I remember we quickly discussed this during a
>> > community meeting.
>> >
>> > I will take a new look at the doc.
>> >
>> > Regards
>> > JB
>> >
>> > On Tue, Nov 26, 2024 at 9:19 PM Vignesh <[email protected]> wrote:
>> > >
>> > > Hi,
>> > > There was a proposal along the same lines, for the read portion few
>> weeks back by Ashvin.
>> > >
>> > >
>> https://docs.google.com/document/d/1yzLXSOtzBXyaWHfeVsWsMu4xmOH8rV6QyM5ZAnJZjMQ/edit?usp=drivesdk
>> > >
>> > > Thanks,
>> > > Vignesh.
>> > >
>> > >
>> > > On Tue, Nov 26, 2024, 11:59 AM Jean-Baptiste Onofré <[email protected]>
>> wrote:
>> > >>
>> > >> Hi Nikhil
>> > >>
>> > >> Thanks for your message, very interesting.
>> > >>
>> > >> I think it would be great to involve the Polaris project here as
>> well,
>> > >> as a REST Catalog implementation.
>> > >> The Polaris community is discussing storage/backend right now, so it
>> > >> would be the perfect timing to consider leveraging S3 conditional
>> > >> writes (as a plugin for instance first).
>> > >>
>> > >> I would be happy to connect and know more about your perspective
>> about that.
>> > >>
>> > >> Thanks,
>> > >> Regards
>> > >> JB
>> > >>
>> > >> PS: I will be at AWS re:Invent next week, so maybe we can connect
>> there.
>> > >>
>> > >> On Tue, Nov 26, 2024 at 6:35 PM Nikhil Benesch <
>> [email protected]> wrote:
>> > >> >
>> > >> > Hi all,
>> > >> >
>> > >> > With Amazon S3 announcing support for the If-Match header
>> yesterday [0], all the
>> > >> > major object store implementations now support a compare-and-swap
>> operation.
>> > >> >
>> > >> > As far as I can tell, this opens up the possibility of storing
>> Iceberg
>> > >> > catalogs directly on object storage, without the need for a
>> separate metastore,
>> > >> > and without violating any of Iceberg's ACID guarantees.
>> > >> >
>> > >> > It seems the immediate next step is to build an independent Java
>> or REST catalog
>> > >> > backend to prove this concept out. Long term, though, the ideal
>> would be to
>> > >> > have such a catalog backend be a first class citizen in the
>> Iceberg project.
>> > >> >
>> > >> > Is anyone else in the Iceberg community barking up this tree? I'm
>> a long term
>> > >> > Iceberg enthusiast, but new to the community. I'd very much
>> appreciate any
>> > >> > pointers to current or past discussions on the topic. So far all
>> I've been
>> > >> > able to turn up is some light chatter from myself and others on
>> Bluesky and
>> > >> > Hacker News ([1][2][3]).
>> > >> >
>> > >> > Cheers,
>> > >> > Nikhil
>> > >> >
>> > >> > [0]:
>> https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/
>> > >> > [1]:
>> https://bsky.app/profile/benesch.bsky.social/post/3lauesxg3ic2c
>> > >> > [2]:
>> https://bsky.app/profile/eatonphil.bsky.social/post/3lbskq3jwk22e
>> > >> > [3]: https://news.ycombinator.com/item?id=42240370
>>
>

Re: Storing catalog directly on object store

Reply via email to