No worries about the name. It is a possible alternative spelling :)

On Wed, May 7, 2025 at 8:04 PM yun zou <yunzou.colost...@gmail.com> wrote:

> Hi Dmitri,
>
> Sorry, I accidentally typed your name wrong in the previous reply!
> Apologize for this!
>
> For the S3 issue, I think we will need to deal with those regardless,
> especially with the federation work going on, we will need to handle all
> those entities eventually coming from different Catalogs, and the URI
> format seems the standard format used by various Catalog services.
>
> Best Regards,
> Yun
>
> On Wed, May 7, 2025 at 4:55 PM yun zou <yunzou.colost...@gmail.com> wrote:
>
> > Hi Dimitri and Eric,
> >
> > Thanks a lot for the feedback!
> >
> > For the questions:
> > - Is one value or many?
> > It will be one value, similar to the location in Iceberg and the
> > storage_location in unity catalog.
> >
> > Regarding to the point about having new data in new locations and keeping
> > old data in old locations, do we support that for Iceberg
> > today?
> > For most of the Spark tables, it seems to only have one location. Also, I
> > think it is better to start restricted first, and then extend it to
> > allow multiple locations when the use case raises.
> >
> > Ref:
> > Iceberg location:
> >
> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451
> > Storage location in Unity Catalog:
> >
> https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml#L3451
> >
> > - Is it a URI?
> > Yes, it will be a URI, which seems the standard catalog implementation.
> > Regarding to the point about s3 v2 s3a, i assume that is a common
> > problem that every catalog implementation needs to address, and we will
> > stay the same on this part. At least from the load table point of view,
> > Spark engine knows how to deal with such cases.
> >
> > - Does it point to any particular file?
> > No, it doesn't point to a particular file. It is the base table location.
> >
> > - Is it a common prefix of all files within a table?
> > It is supposed to be the base table location, which theoretically should
> > be the common prefix of all files within a table I believe.
> >
> > - What happens when a value does not match these expectations?
> > Whether it is one value or many is restricted by the spec already.
> > For URI format, I think we can do a format check, and fail it.
> > Other than that, we will not do any other special check, and we rely on
> > the client to put the correct value, otherwise, the other engine will
> > not be able to successfully read the table.
> >
> > For the location keyword, as Eric has pointed out, we can potentially
> have
> > a reserved key for the properties. However, location is a common
> > enough key among various table formats, which worths a dedicated key to
> > help store and load the information in a more straightforward
> > way.  For things that are specific to one or two formats, I think it
> makes
> > more sense to use a reserved property key.
> >
> > As a reference, in Iceberg, the CreateTable request and TableMetadata
> does
> > have an explicit location key in the spec. For write.data.path
> > and write.metadata.path, they are passed as properties today.
> >
> > Best Regards,
> > Yun
> >
> >
> > On Wed, May 7, 2025 at 3:54 PM Dmitri Bourlatchkov <di...@apache.org>
> > wrote:
> >
> >> Another point: I'm pretty sure sooner or later users will want to move
> >> their data to some other location. As an option users may want to write
> >> new
> >> files into another location but keep old files in place.
> >>
> >> Also: if the location is a URI, how do we deal with s3 vs. s3a for
> >> example?
> >>
> >> In Iceberg it is quite common for different engines to use different
> >> access
> >> tools, which often leads to different URI schemes.
> >>
> >> Cheers,
> >> Dmitri.
> >>
> >> On Wed, May 7, 2025 at 6:46 PM Eric Maynard <eric.w.mayn...@gmail.com>
> >> wrote:
> >>
> >> > All good questions Dmitri — I’m especially interested in the first one
> >> as
> >> > from what I understand Iceberg tables can have metadata and data at
> two
> >> > different paths that we need to vend credentials for.
> >> >
> >> > For iceberg tables, we just use special properties to track these
> >> > locations. I wonder if we couldn’t do the same for generic tables.
> >> >
> >> > On Wed, May 7, 2025 at 3:42 PM Dmitri Bourlatchkov <di...@apache.org>
> >> > wrote:
> >> >
> >> > > Hi Yun,
> >> > >
> >> > > Please clarify the meaning of the value of the new location
> attribute.
> >> > >
> >> > > - Is is one value or many?
> >> > > - Is it a URI?
> >> > > - Does it point to any particular file?
> >> > > - Is it a common prefix of all files within a table?
> >> > > - What happens when a value does not match these expectation?
> >> > >
> >> > > Thanks,
> >> > > Dmitri.
> >> > >
> >> > > On 2025/05/07 21:50:19 yun zou wrote:
> >> > > > Hi folks,
> >> > > >
> >> > > > I would like to propose to add an optional `location` field to
> >> > > > CreateGenricTable Request and LoadGenericTable response.
> >> > > >
> >> > > > The `location` is the location for the table, which is common to
> >> most
> >> > > table
> >> > > > formats including Iceberg, Delta, Hudi, csv, parquet etc. The
> >> location
> >> > > > information is critical for loading the table at engine side,
> >> having a
> >> > > > dedicated keyword could help improve the robustness for cross
> engine
> >> > > > sharing, instead of relying on the properties passed by the client
> >> > side.
> >> > > >
> >> > > > Furthermore, this information is also required to provide
> credential
> >> > > > vending capabilities later.
> >> > > >
> >> > > > Here is the PR for adding the spec:
> >> > > > https://github.com/apache/polaris/pull/1543
> >> > > >
> >> > > > Looking forward to your reply and feedback!
> >> > > >
> >> > > > Best Regards,
> >> > > > Yun
> >> > > >
> >> > >
> >> >
> >>
> >
>

Reply via email to