Re: Storing catalog directly on object store

2024-12-06 Thread Steve Loughran
I am not expressing any opinion on the product whatsoever. What I will note is that I have spent 8 weeks full time this year dealing with AWS Java SDK problems in the more foundational parts of the SDK. https://github.com/steveloughran/engineering-proposals/blob/trunk/refactoring-s3a.md#aws-sdk-v

Re: Storing catalog directly on object store

2024-12-05 Thread Nikhil Benesch
> - Whether we should build S3 Tables catalog support similar to what we do for > AWS Glue. Yes, happy to have someone start that discussion separately, if it makes sense to do so. Amazon has already provided such an catalog implementation in a separate Apache 2.0-licensed project called Amazon S3

Re: Storing catalog directly on object store

2024-12-03 Thread Vladimir Ozerov
I second Ryan’s opinion that production-grade catalog is a much broader concept than just CAS-ing the pointer. What we observe in practice in our company, is that users want to work with large schemas (sometimes - with literally thousands schemes and millions tables), have support for common DDL o

Re: Storing catalog directly on object store

2024-12-03 Thread Xuanwo
Hi, Nikhil Thank you very much for bringing S3 tables discussion here. However, I would like to point out that the S3 Table is not the same concept we are discussing here. It is not an object storage-based catalog; instead, it is a stateful service that provides dedicated APIs. It’s better to

Re: Storing catalog directly on object store

2024-12-03 Thread Nikhil Benesch
> And I'm also looking forward to what Jack is alluding to. AWS just announced *native* S3 support for Iceberg buckets! [0] This is almost surely what Jack was alluding to. This is very cool. It's a much deeper integration than I was expecting but nonetheless one that fully satisfies my use case

Re: Storing catalog directly on object store

2024-11-27 Thread rdb...@gmail.com
> We deprecated this recently and we don't have to deprecate it if object stores support atomic operations like this. I disagree because this misses many of the reasons for deprecation. It isn't just that S3 didn't support a `putIfAbsent` operation. Other object stores did and there are still seve

Re: Storing catalog directly on object store

2024-11-27 Thread Steve Loughran
There's a PR up from amazon to add this to the s3a connector https://github.com/apache/hadoop/pull/7011 targeting a 3.4.2 release early next year, though they've not updated the PR as requested yet. 1. It doesn't give you the same semantics as posix create-no-overwrite call -you only get t

Re: Storing catalog directly on object store

2024-11-27 Thread Alex Merced
Ignore the last email, just re-read the proposal earlier in the email chain On Wed, Nov 27, 2024 at 11:37 AM Alex Merced wrote: > This is just a quick thought to put out there: If there will be a new > reimagining of a file system catalog, would it be worth adding a > multi-table layer on top? >

Re: Storing catalog directly on object store

2024-11-27 Thread Alex Merced
This is just a quick thought to put out there: If there will be a new reimagining of a file system catalog, would it be worth adding a multi-table layer on top? *As a rough example:* - At the TOP is a JSON file that is just a mapping of the table name to the directory where VERSION-HINT would be

Re: Storing catalog directly on object store

2024-11-27 Thread Manu Zhang
I think one major issue with current HadoopCatalog is that there's no way to manage tables by name. If adding one metadata layer on top of it, we need to handle more consistency challenges. Manu On Wed, Nov 27, 2024 at 8:03 PM Gabor Kaszab wrote: > Hi All, > > Xuanwo, I recall the reasoning aga

Re: Storing catalog directly on object store

2024-11-27 Thread Gabor Kaszab
Hi All, Xuanwo, I recall the reasoning against HadoopCatalog was the other way around: even though it is safe to use on HDFS, it is unsafe on object storage. I believe that this gap of functionalities of object stores seems to go away, so for me HadoopCatalog would even make more sense now than be

Re: Storing catalog directly on object store

2024-11-27 Thread Xuanwo
Hi I believe we still need to deprecate HadoopCatalog since the operation is still not safe on Hadoop. As raised by Jack Ye before, I suggest we consider having a StorageCatalog or ObjectStorageCatalog that can only be used with storage services supporting conditional writes. That would be a go

Re: Storing catalog directly on object store

2024-11-26 Thread Nikhil Benesch
Makes sense! I'd be eager to chat more about this but I'm afraid I won't be at re:Invent. Maybe we plan to circle back after re:Invent, once we see what AWS announces? On Tue, Nov 26, 2024 at 2:58 PM Jean-Baptiste Onofré wrote: > > Hi Nikhil > > Thanks for your message, very interesting. > > I th

Re: Storing catalog directly on object store

2024-11-26 Thread Nikhil Benesch
Indeed, I got pointed at that feature on Bluesky earlier today [0]. I dredged up the mailing list discussion that occurred around its deprecation, and this exact point actually came up. There was some concern from Ryan that the complexity of keeping the file system tables around just wasn't worth i

Re: Storing catalog directly on object store

2024-11-26 Thread Ajantha Bhat
Interesting. We already have file system tables [1] in Iceberg (HadoopCatalog implements this spec). We deprecated this recently and we don't have to deprecate it if object stores support atomic operations like this. [1] https://iceberg.apache.org/spec/#file-system-tables - Ajantha On Wed, Nov

Re: Storing catalog directly on object store

2024-11-26 Thread Nikhil Benesch
Ah, fascinating. Thanks very much for the pointer. Here's the thread introducing the proposal [0], for anyone else curious. [0]: https://lists.apache.org/thread/kh4n98w4z22sc8h2vot4q8n44vdtnltg On Tue, Nov 26, 2024 at 3:27 PM Jean-Baptiste Onofré wrote: > > Hi Vignesh > > Thanks for the reminde

Re: Storing catalog directly on object store

2024-11-26 Thread Jean-Baptiste Onofré
Hi Vignesh Thanks for the reminder, I remember we quickly discussed this during a community meeting. I will take a new look at the doc. Regards JB On Tue, Nov 26, 2024 at 9:19 PM Vignesh wrote: > > Hi, > There was a proposal along the same lines, for the read portion few weeks > back by Ashvi

Re: Storing catalog directly on object store

2024-11-26 Thread Vignesh
Hi, There was a proposal along the same lines, for the read portion few weeks back by Ashvin. https://docs.google.com/document/d/1yzLXSOtzBXyaWHfeVsWsMu4xmOH8rV6QyM5ZAnJZjMQ/edit?usp=drivesdk Thanks, Vignesh. On Tue, Nov 26, 2024, 11:59 AM Jean-Baptiste Onofré wrote: > Hi Nikhil > > Thanks for

Re: Storing catalog directly on object store

2024-11-26 Thread Jean-Baptiste Onofré
Hi Nikhil Thanks for your message, very interesting. I think it would be great to involve the Polaris project here as well, as a REST Catalog implementation. The Polaris community is discussing storage/backend right now, so it would be the perfect timing to consider leveraging S3 conditional writ

Re: Storing catalog directly on object store

2024-11-26 Thread Nikhil Benesch
Talk about tenterhooks! But okay, I take your hint. :) On Tue, Nov 26, 2024 at 2:17 PM Jack Ye wrote: > > Hi Nikhil, > > I am also personally very excited about S3 adding this support! > > I would suggest we discuss this after the AWS re:invent 2024 event that is > coming right next week, as the

Re: Storing catalog directly on object store

2024-11-26 Thread Jack Ye
Hi Nikhil, I am also personally very excited about S3 adding this support! I would suggest we discuss this after the AWS re:invent 2024 event that is coming right next week, as there are going to be more S3 feature announcements during that week, and the community can have a more comprehensive di