Re: Storing catalog directly on object store

Alex Merced Wed, 27 Nov 2024 08:37:59 -0800

This is just a quick thought to put out there: If there will be a new
reimagining of a file system catalog, would it be worth adding a
multi-table layer on top?


*As a rough example:*

- At the TOP is a JSON file that is just a mapping of the table name to the
directory where VERSION-HINT would be found (this is so the file is only
updated when tables are created or dropped)
- Then Engine finds the directory and uses the VERSION-HINT like normal to
discover metadata and plan the scan

This way, you have a listing of all your tables, so you don't have to
re-register each table with each tool but still can avoid having to run a
full service on top for basic application

*Governance in this Type of Catalog:*

- You can group different tables into different JSON files/catalogs
- Then file access controls on the JSON file can be used as a simple way to
control user access to groups of tables


On Wed, Nov 27, 2024 at 8:27 AM Manu Zhang <[email protected]> wrote:

> I think one major issue with current HadoopCatalog is that there's no way
> to manage tables by name. If adding one metadata layer on top of it, we
> need to handle more consistency challenges.
>
> Manu
>
> On Wed, Nov 27, 2024 at 8:03 PM Gabor Kaszab <[email protected]>
> wrote:
>
>> Hi All,
>>
>> Xuanwo, I recall the reasoning against HadoopCatalog was the other way
>> around: even though it is safe to use on HDFS, it is unsafe on object
>> storage. I believe that this gap of functionalities of object stores seems
>> to go away, so for me HadoopCatalog would even make more sense now than
>> before. The name might not be straightforward as it's not just for Hadoop.
>>
>> Regards,
>> Gabor
>>
>>
>> On Wed, Nov 27, 2024 at 9:02 AM Xuanwo <[email protected]> wrote:
>>
>>> Hi
>>>
>>> I believe we still need to deprecate HadoopCatalog since the operation
>>> is still not safe on Hadoop. As raised by Jack Ye before, I suggest we
>>> consider having a StorageCatalog or ObjectStorageCatalog that can only be
>>> used with storage services supporting conditional writes. That would be a
>>> good approach.
>>>
>>> On Wed, Nov 27, 2024, at 15:47, Nikhil Benesch wrote:
>>> > Makes sense! I'd be eager to chat more about this but I'm afraid I
>>> won't be at
>>> > re:Invent. Maybe we plan to circle back after re:Invent, once we see
>>> what AWS
>>> > announces?
>>> >
>>> > On Tue, Nov 26, 2024 at 2:58 PM Jean-Baptiste Onofré <[email protected]>
>>> wrote:
>>> >>
>>> >> Hi Nikhil
>>> >>
>>> >> Thanks for your message, very interesting.
>>> >>
>>> >> I think it would be great to involve the Polaris project here as well,
>>> >> as a REST Catalog implementation.
>>> >> The Polaris community is discussing storage/backend right now, so it
>>> >> would be the perfect timing to consider leveraging S3 conditional
>>> >> writes (as a plugin for instance first).
>>> >>
>>> >> I would be happy to connect and know more about your perspective
>>> about that.
>>> >>
>>> >> Thanks,
>>> >> Regards
>>> >> JB
>>> >>
>>> >> PS: I will be at AWS re:Invent next week, so maybe we can connect
>>> there.
>>> >>
>>> >> On Tue, Nov 26, 2024 at 6:35 PM Nikhil Benesch <
>>> [email protected]> wrote:
>>> >> >
>>> >> > Hi all,
>>> >> >
>>> >> > With Amazon S3 announcing support for the If-Match header yesterday
>>> [0], all the
>>> >> > major object store implementations now support a compare-and-swap
>>> operation.
>>> >> >
>>> >> > As far as I can tell, this opens up the possibility of storing
>>> Iceberg
>>> >> > catalogs directly on object storage, without the need for a
>>> separate metastore,
>>> >> > and without violating any of Iceberg's ACID guarantees.
>>> >> >
>>> >> > It seems the immediate next step is to build an independent Java or
>>> REST catalog
>>> >> > backend to prove this concept out. Long term, though, the ideal
>>> would be to
>>> >> > have such a catalog backend be a first class citizen in the Iceberg
>>> project.
>>> >> >
>>> >> > Is anyone else in the Iceberg community barking up this tree? I'm a
>>> long term
>>> >> > Iceberg enthusiast, but new to the community. I'd very much
>>> appreciate any
>>> >> > pointers to current or past discussions on the topic. So far all
>>> I've been
>>> >> > able to turn up is some light chatter from myself and others on
>>> Bluesky and
>>> >> > Hacker News ([1][2][3]).
>>> >> >
>>> >> > Cheers,
>>> >> > Nikhil
>>> >> >
>>> >> > [0]:
>>> https://aws.amazon.com/about-aws/whats-new/2024/11/amazon-s3-functionality-conditional-writes/
>>> >> > [1]:
>>> https://bsky.app/profile/benesch.bsky.social/post/3lauesxg3ic2c
>>> >> > [2]:
>>> https://bsky.app/profile/eatonphil.bsky.social/post/3lbskq3jwk22e
>>> >> > [3]: https://news.ycombinator.com/item?id=42240370
>>>
>>> --
>>> Xuanwo
>>>
>>> https://xuanwo.io/
>>>
>>

-- 

*Alex Merced <https://bio.alexmerced.com/data> *
*Senior Tech Evangelist, Dremio **Dremio.com*
<https://www.dremio.com/?utm_medium=email&utm_source=signature&utm_term=na&utm_content=email-signature&utm_campaign=email-signature>*/
**Follow Us on LinkedIn!* <https://www.linkedin.com/company/dremio>
*Resources for Getting Hands-on with Apache Iceberg/Dremio*
<https://medium.com/data-engineering-with-dremio/a-deep-intro-to-apache-iceberg-and-resources-for-learning-more-be51535cff74>

Re: Storing catalog directly on object store

Reply via email to