Re: Storing catalog directly on object store

Steve Loughran Fri, 06 Dec 2024 05:22:32 -0800

I am not expressing any opinion on the product whatsoever.

What I will note is that I have spent 8 weeks full time this year dealing
with AWS Java SDK problems in the more foundational parts of the SDK.


https://github.com/steveloughran/engineering-proposals/blob/trunk/refactoring-s3a.md#aws-sdk-v2-upgrade-is-a-continuous-source-of-pain-and-a-time-sink

Some of these are obscure, and are most likely to be experienced over
long-haul connections especially when there is a proxy in the way. Even
then, they are so rare that you are never going to find them during
testing, but only in production -and then only as many petabytes of data go
through the s3a code base most days.

* HADOOP-19221 A broken connection during a PUT request of a file while
awaiting a 100 CONTINUE response isn't detected and recovered from. Hard to
find, easy to fix.
* HADOOP-19221 S3A: Unable to recover from failure of multipart block
upload attempt "Status Code: 400; Error Code: RequestTimeout". Somewhat
easier to find, very hard to fix.

Then there are also some which are trivial to identify and affect many
customers, HADOOP-19181
S3A: IAMCredentialsProvider throttling results in AWS auth failures. There
Impala testing found it, because they run many services on a single EC2
instances. Essentially the credential provider only refreshes credentials
off IAM one second before they expire -but if a 503 comes back (as
triggered by those multiple services), it does a jitter of 1-10 seconds
before retrying.

For that I did a full AWS SDK bug report,
<https://github.com/aws/aws-sdk-java-v2/issues/5247> with detail, back in
May. It is now December and it has not been fixed, even though the fix
"refresh credentials more than ten seconds before" is obvious. This is
exactly what the k8s container does. (*)

*The new S3 Table is a completely new store with a completely new API, so a
completely new module in the SDK. *

If foundational things like IAM authentication don't get fixed from a
detailed report six months later, do you really want to be fielding S3
Table issues? One that is probably quite expensive to test. Whoever
implements this is going to be left trying to work around problems. This is
the best delegated to the AWS S3Table team as they may actually get some
support.

-Steve

* be really good if other people commented on there to make it clear it
affects more than just me and my colleagues. Anyone who uses the Java SDK
to get IAM credentials is at risk of this.

On Thu, 5 Dec 2024 at 22:01, Nikhil Benesch <[email protected]>
wrote:

> > - Whether we should build S3 Tables catalog support similar to what we
> do for
> > AWS Glue.
>
> Yes, happy to have someone start that discussion separately, if it makes
> sense
> to do so. Amazon has already provided such an catalog implementation in
> a separate Apache 2.0-licensed project called Amazon S3 Tables Catalog for
> Apache Iceberg [0].
>
> I'm not familiar enough with the way the Iceberg project operates to know
> whether it would make sense to package that implementation as part of the
> official Iceberg distribution.
>
> - Continuing the discussion about the object storage-based catalog.
>
> I'm happy to report that I got pointed at a project that is planning to
> build
> exactly this. [1]
>
> The use case I was interested in is actually entirely solved by S3 Tables,
> so I no longer plan to pursue this. But if someone else is interested in
> picking
> this up, I'm sure Jan Kaul would be eager to collaborate.
>
> [0]: https://github.com/awslabs/s3-tables-catalog
> [1]: https://bsky.app/profile/jankaul.bsky.social/post/3lbutx7ju4k2c
>
> On Wed, Dec 4, 2024 at 12:11 AM Xuanwo <[email protected]> wrote:
> >
> > Hi, Nikhil
> >
> > Thank you very much for bringing S3 tables discussion here.
> >
> > However, I would like to point out that the S3 Table is not the same
> concept we are discussing here. It is not an object storage-based catalog;
> instead, it is a stateful service that provides dedicated APIs. It’s better
> to think of it as another AWS Glue, but internally backed by an S3 bucket.
> >
> > Therefore, I believe we should split this into two separate discussion
> threads:
> >
> > - Whether we should build S3 Tables catalog support similar to what we
> do for AWS Glue.
> > - Continuing the discussion about the object storage-based catalog.
> >
> >
> > On Wed, Dec 4, 2024, at 03:17, Nikhil Benesch wrote:
> >
> > > And I'm also looking forward to what Jack is alluding to.
> >
> > AWS just announced *native* S3 support for Iceberg buckets! [0] This is
> almost surely what Jack was alluding to.
> >
> > This is very cool. It's a much deeper integration than I was expecting
> but nonetheless one that fully satisfies my use case [1].
> >
> > In classic AWS fashion the documentation for the feature has not yet
> been published. I'm also can't find the "Amazon S3 Tables Catalog for
> Apache Iceberg" package that Jeff Barr references in his announcement post.
> I'll circle back with details once these materials are made available.
> >
> > [0]:
> https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads/
> > [1]: We're looking to add a native Iceberg-on-S3 export feature to
> Materialize (https://materialize.com), but without requiring users to
> manage a catalog.
> >
> > On Wed, Nov 27, 2024 at 1:52 PM [email protected] <[email protected]>
> wrote:
> >
> > > We deprecated this recently and we don't have to deprecate it if
> object stores support atomic operations like this.
> >
> > I disagree because this misses many of the reasons for deprecation. It
> isn't just that S3 didn't support a `putIfAbsent` operation. Other object
> stores did and there are still several problems with this approach. The
> fundamental issue is that it is attempting to solve problems at the wrong
> level.
> >
> > One of the reasons why Iceberg exists is that we saw people doing the
> same thing with Parquet. People were trying to solve problems with their
> tables by attempting to modify Parquet in wacky ways, like wanting to
> replace the footer to make schema changes. Schema evolution needed to be
> solved at the table level and in this community we've always tried to solve
> problems more directly and elegantly by addressing them at the right layer
> of the stack.
> >
> > Iceberg tables scale up existing atomic operations to make transactional
> guarantees on very large tables. Object stores and file systems aren't well
> suited for this task. Just like they were not sufficient to provide
> transactional guarantees across files and partitions, the primitives you
> can use aren't sufficient for a database. Storage capabilities are also not
> the right place to deliver other catalog features, like basic CRUD
> operations.
> >
> > The addition of `putIfAbsent` to S3 doesn't support transactions where
> you need to modify multiple tables and it also doesn't address cases like
> the need to atomically rename and delete tables. Schemes that use
> `putIfAbsent` also rely either on consistent listing a large prefix or on
> maintaining a version-hint file. That version-hint file can be out of date,
> so even with one you still need to list or iteratively attempt to read
> metadata files to determine the latest.
> >
> > Getting a file-only scheme right is complicated and is specific to a
> particular store (both commits and version-hint handling). Local file
> systems would use an exclusive create operation to commit, Hadoop uses
> atomic rename, and object stores use different `putIfAbsent` operations.
> Making this work across languages and engines requires a lot of work to
> specify requirements and document issues, only to get to single-table
> functionality that doesn't deliver the catalog-level primitives like atomic
> rename that are commonly used.
> >
> > In the end, catalog problems are best solved at the catalog layer, not
> through an elaborate scheme that uses storage-layer primitives, just as it
> was not a good idea to deliver table behaviors using similar storage-layer
> schemes. Adding `putIfAbsent` to S3 doesn't change that design principle.
> >
> > I sympathize with the idea that it would be great if you didn't need a
> catalog. Simpler infrastructure is generally better.
> >
> > But trying to avoid a catalog limits the capabilities of this
> infrastructure, while setting people up for later failure. When I talk with
> people that have been trying to avoid having a catalog, they tend to have
> tables scattered across buckets that they need to track down, they lack
> observability to know what is being used, don't to know if they are
> deleting data in compliance with regulations, and they often lack simple
> and usable access controls.
> >
> > I think that the solution is to make it easier to run or use a catalog,
> not to try to build without one.
> >
> > And I'm also looking forward to what Jack is alluding to.
> >
> > On Tue, Nov 26, 2024 at 11:05 PM Ajantha Bhat <[email protected]>
> wrote:
> >
> > Interesting.
> >
> > We already have file system tables [1] in Iceberg (HadoopCatalog
> implements this spec).
> > We deprecated this recently and we don't have to deprecate it if object
> stores support atomic operations like this.
> >
> > [1] https://iceberg.apache.org/spec/#file-system-tables
> >
> > - Ajantha
> >
> > On Wed, Nov 27, 2024 at 2:53 AM Nikhil Benesch <[email protected]>
> wrote:
> >
> > Ah, fascinating. Thanks very much for the pointer.
> >
> > Here's the thread introducing the proposal [0], for anyone else curious.
> >
> > [0]: https://lists.apache.org/thread/kh4n98w4z22sc8h2vot4q8n44vdtnltg
> >
> > On Tue, Nov 26, 2024 at 3:27 PM Jean-Baptiste Onofré <[email protected]>
> wrote:
> > >
> > > Hi Vignesh
> > >
> > > Thanks for the reminder, I remember we quickly discussed this during a
> > > community meeting.
> > >
> > > I will take a new look at the doc.
> > >
> > > Regards
> > > JB
> > >
> > > On Tue, Nov 26, 2024 at 9:19 PM Vignesh <[email protected]>
> wrote:
> > > >
> > > > Hi,
> > > > There was a proposal along the same lines, for the read portion few
> weeks back by Ashvin.
> > > >
> > > >
> https://docs.google.com/document/d/1yzLXSOtzBXyaWHfeVsWsMu4xmOH8rV6QyM5ZAnJZjMQ/edit?usp=drivesdk
> > > >
> > > > Thanks,
> > > > Vignesh.
> > > >
> > > >
> > > > On Tue, Nov 26, 2024, 11:59 AM Jean-Baptiste Onofré <[email protected]>
> wrote:
> > > >>
> > > >> Hi Nikhil
> > > >>
> > > >> Thanks for your message, very interesting.
> > > >>
> > > >> I think it would be great to involve the Polaris project here as
> well,
> > > >> as a REST Catalog implementation.
> > > >> The Polaris community is discussing storage/backend right now, so it
> > > >> would be the perfect timing to consider leveraging S3 conditional
> > > >> writes (as a plugin for instance first).
> > > >>
> > > >> I would be happy to connect and know more about your perspective
> about that.
> > > >>
> > > >> Thanks,
> > > >> Regards
> > > >> JB
> > > >>
> > > >> PS: I will be at AWS re:Invent next week, so maybe we can connect
> there.
> > > >>
> > > >> On Tue, Nov 26, 2024 at 6:35 PM Nikhil Benesch <
> [email protected]> wrote:
> > > >> >
> > > >> > Hi all,
> > > >> >
> > > >> > With Amazon S3 announcing support for the If-Match header
> yesterday [0], all the
> > > >> > major object store implementations now support a compare-and-swap
> operation.
> > > >> >
> > > >> > As far as I can tell, this opens up the possibility of storing
> Iceberg
> > > >> > catalogs directly on object storage, without the need for a
> separate metastore,
> > > >> > and without violating any of Iceberg's ACID guarantees.
> > > >> >
> > > >> > It seems the immediate next step is to build an independent Java
> or REST catalog
> > > >> > backend to prove this concept out. Long term, though, the ideal
> would be to
> > > >> > have such a catalog backend be a first class citizen in the
> Iceberg project.
> > > >> >
> > > >> > Is anyone else in the Iceberg community barking up this tree? I'm
> a long term
> > > >> > Iceberg enthusiast, but new to the community. I'd very much
> appreciate any
> > > >> > pointers to current or past discussions on the topic. So far all
> I've been
> > > >> > able to turn up is some light chatter from myself and others on
> Bluesky and
> > > >> > Hacker News ([1][2][3]).
> > > >> >
> > > >> > Cheers,
> > > >> > Nikhil
> > > >> >
> > > >> > [0]:
> https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/
> > > >> > [1]:
> https://bsky.app/profile/benesch.bsky.social/post/3lauesxg3ic2c
> > > >> > [2]:
> https://bsky.app/profile/eatonphil.bsky.social/post/3lbskq3jwk22e
> > > >> > [3]: https://news.ycombinator.com/item?id=42240370
> >
> > Xuanwo
> >
> > https://xuanwo.io/
> >
>

Re: Storing catalog directly on object store

Reply via email to