Re: Current Status of View Specification

Ryan Blue Thu, 04 May 2023 16:52:43 -0700

> The assumption helps if all tables are in the same catalog, and the view
is also in catalog but this assumption can potentially break down


Yeah, I think we have to go with option 2. Otherwise you'd be embedding
catalog config in views and it starts getting really ugly.

I don't really see a path where we don't assume that you've used consistent
naming across engines. Not using consistent names would be weird, and using
names that actually conflict would be even worse. If you want reliability
you have to use the same names for views, as well as people. It would be
bonkers to see catalogs mean different things in different places.

Not only that, this is also how columns work. I don't know of anything that
tracks view columns by their context and ID. If your view is "select a, b
from table" and "a" is dropped and re-created, you get the new one right?

On Tue, May 2, 2023 at 4:40 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Sorry for the late reply.
>
> I put together https://github.com/apache/iceberg/pull/7504/files to cover
> some of the answers here.
>
> To follow-up on:
>
>> > 1.  If `default-catalog`, `default-namespace` are not specified I would
>> guess it is assumed that any non-fully qualified tables should be resolved
>> against the same catalog and namespace that the view is registered in?
>> I think that names should be fully qualified if these are not set. I
>> don't think that we want to have the engine guess based on the location of
>> the view, but I'd be open to discussion to clarify the spec on this.
>
>
> To clarify here, the use-case I'm considering here is federation across
> catalogs systems.  The assumption helps if all tables are in the same
> catalog, and the view is also in catalog but this assumption can
> potentially break down.  Off the top of my head it seems like there are two
> options to not rely on implicit behavior:
> 1.  Include configuration in the view definitions on what system a catalog
> in a table name refers to (e.g. catalog xyz=HMS endpoint, catalog
> abc=Iceberg REST endpoint, etc).
> 2.  Declare this problem out of scope for the view specification and
> require users to ensure all query engines have their catalogs configured in
> a uniform way (this seems somewhat painful of an end-user experience).
>
> Thanks,
> Micah
>
>
>
>
>
>
>
>
> On Wed, Mar 15, 2023 at 10:56 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Thanks Ryan for the quick response.
>>
>> I think that names should be fully qualified if these are not set. I
>>> don't think that we want to have the engine guess based on the location of
>>> the view, but I'd be open to discussion to clarify the spec on this.
>>
>> One issue I think with supporting multiple dialects is that there are
>> varying supports for how many namespaces they support.  And if someone is
>> designing a workflow to generate SQL in multiple dialects they might not
>> know apriori how to map from one set of namespaces to the other set.  Being
>> able to late bind here, to know that all tables references are under the
>> same namespace could be a work-around for this issue but.
>>
>> We have a "dialect" field on the SQL view representation, and we should
>>> document what goes into that field. I think it makes sense to have a
>>> version of the dialect as well.
>>
>>
>> Right, the situation I would like to avoid is having different people
>> choosing slightly different naming schemas for the same engine (i.e.
>> "Spark", "spark" and "spark-sql").  It might pay to have namespacing scheme
>> that allows for people freedom to choose whatever they want, but reserving
>> a prefix for well understood engines in the Iceberg community (e.g. a
>> prefix of "iceberg-engine." could be reserved as denoting engines that the
>> community has officially agreed on naming for, so "iceberg-engine.spark" to
>> continue the analogy)
>>
>> I'll try to put together a pull request on the other items as they don't
>> seem like they warrant more conversation.
>>
>> Thanks,
>> Micah
>>
>> On Wed, Mar 15, 2023 at 1:56 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> > 1.  If `default-catalog`, `default-namespace` are not specified I
>>> would guess it is assumed that any non-fully qualified tables should be
>>> resolved against the same catalog and namespace that the view is registered
>>> in?
>>>
>>> I think that names should be fully qualified if these are not set. I
>>> don't think that we want to have the engine guess based on the location of
>>> the view, but I'd be open to discussion to clarify the spec on this.
>>>
>>> > 2.  It seems like there are two optional methods for specifying
>>> documentation for columns in the view, one is field-docs, and one is the
>>> doc field on schema.  Is the assumption that only one or the other would be
>>> set?
>>>
>>> The field-docs take precedence. The format we're using follows the SQL
>>> syntax, where you can create a view with top-level column names and field
>>> docs. We store everything that comes through to avoid losing information.
>>>
>>> > 3.  Are field-aliases intended to be further documentation or consumed
>>> programmatically (i.e. can they be referenced in queries that use the view)?
>>>
>>> These are also from the SQL syntax and should be applied like field
>>> comments to rename fields.
>>>
>>> > 4.  Are there any thoughts on standardization of dialects, at least
>>> for common OSS query engines?  Should there be an option to encode dialect
>>> version encoded at least informationally in case a view uses a SQL feature
>>> that was only introduced in later versions?
>>>
>>> We have a "dialect" field on the SQL view representation, and we should
>>> document what goes into that field. I think it makes sense to have a
>>> version of the dialect as well.
>>>
>>> On Tue, Mar 14, 2023 at 11:48 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> I apologize if some of these items are already discussed or too obvious
>>>> but a few questions after reading the spec in a little more detail:
>>>> 1.  If `default-catalog`, `default-namespace` are not specified I would
>>>> guess it is assumed that any non-fully qualified tables should be resolved
>>>> against the same catalog and namespace that the view is registered in?
>>>> 2.  It seems like there are two optional methods for specifying
>>>> documentation for columns in the view, one is field-docs, and one is the
>>>> doc field on schema.  Is the assumption that only one or the other would be
>>>> set?
>>>> 3.  Are field-aliases intended to be further documentation or consumed
>>>> programmatically (i.e. can they be referenced in queries that use the 
>>>> view)?
>>>> 4.  Are there any thoughts on standardization of dialects, at least for
>>>> common OSS query engines?  Should there be an option to encode
>>>> dialect version encoded at least informationally in case a view uses a SQL
>>>> feature that was only introduced in later versions?
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Tue, Mar 14, 2023 at 8:02 PM John Zhuge <jzh...@apache.org> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Mar 14, 2023 at 7:31 PM Walaa Eldin Moustafa <
>>>>> wa.moust...@gmail.com> wrote:
>>>>>
>>>>>> +1 to get a basic implementation in. Some of the discussions/feedback
>>>>>> on the API PR slightly changed the API from the initial proposed API that
>>>>>> probably more closely resembled Netflix's implementation. Getting an
>>>>>> implementation going on the finalized APIs could give some good feedback 
>>>>>> to
>>>>>> the spec or the finalized APIs.
>>>>>>
>>>>>> On Tue, Mar 14, 2023 at 7:10 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>
>>>>>>> Thanks for the detailed update, Amogh!
>>>>>>>
>>>>>>> I think that the view spec is close, but it will likely change as we
>>>>>>> get the implementation done. Netflix has a working implementation that
>>>>>>> we're basing this off of, so I think it won't change significantly, but 
>>>>>>> we
>>>>>>> don't want to vote to adopt the spec before it's ready.
>>>>>>>
>>>>>>> The next steps are to get the implementation done, which I think is
>>>>>>> a high priority now that branches and tags are being released in 1.2.0.
>>>>>>> We'd love more help implementing this, and especially help implementing 
>>>>>>> the
>>>>>>> spec in other languages because that will definitely catch things we
>>>>>>> haven't seen yet.
>>>>>>>
>>>>>>> Let's talk about this at the sync tomorrow.
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Tue, Mar 14, 2023 at 9:41 AM Jahagirdar, Amogh
>>>>>>> <jaham...@amazon.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi Micah,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks for starting this discussion. I also am not aware of a
>>>>>>>> formal vote thread for the View Specification. I’m open to starting a
>>>>>>>> formal vote if one wasn’t already done to ensure we have made all the 
>>>>>>>> right
>>>>>>>> considerations across the community.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Currently, we are working on the view metadata parser
>>>>>>>> implementations based on the spec as it’s defined today
>>>>>>>> <https://iceberg.apache.org/view-spec/>. A few components have
>>>>>>>> already been merged but the root level metadata parser is still WIP. 
>>>>>>>> So if
>>>>>>>> there’s more discussion on aspects of the spec which we want to 
>>>>>>>> revisit, we
>>>>>>>> should do that before exposing the root level metadata parser.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> For blockers to adoption, there’s a few implementations we need to
>>>>>>>> complete after we conclude on the spec:
>>>>>>>>
>>>>>>>> 1.) Completing the parser implementations. Right now, the view version
>>>>>>>> parser implementation <https://github.com/apache/iceberg/pull/6861>
>>>>>>>> is still in progress, and after this we’d do the overall view metadata
>>>>>>>> parser, and expose that in the library.
>>>>>>>> 2.) A “BaseMetastoreViewCatalog” or something of that sort where
>>>>>>>> common view operation implementations can be performed for metastore 
>>>>>>>> based
>>>>>>>> catalogs.
>>>>>>>>
>>>>>>>> Then I think the community can progress on the different view
>>>>>>>> catalog implementations and engine integration to further help 
>>>>>>>> adoption.
>>>>>>>>
>>>>>>>>
>>>>>>>> Open to suggestions and ideas here!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Amogh Jahagirdar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From: *Micah Kornfield <emkornfi...@gmail.com>
>>>>>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
>>>>>>>> *Date: *Monday, March 13, 2023 at 6:17 PM
>>>>>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org>
>>>>>>>> *Subject: *[EXTERNAL] Current Status of View Specification
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *CAUTION*: This email originated from outside of the organization.
>>>>>>>> Do not click links or open attachments unless you can confirm the 
>>>>>>>> sender
>>>>>>>> and know the content is safe.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Iceberg Dev,
>>>>>>>>
>>>>>>>> I see the spec has been checked in but I couldn't find a vote
>>>>>>>> thread ratifying it as a final V1 version  (I might have been using the
>>>>>>>> wrong search terms) but for other additions of things like puffin it 
>>>>>>>> seemed
>>>>>>>> like there was an official vote.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Should the spec be considered finalized as a V1 version now?  Was
>>>>>>>> there a vote held?  Will there be one?  Are there any blockers to 
>>>>>>>> adoption?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Micah
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Tabular
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> John Zhuge
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

-- 
Ryan Blue
Tabular

Re: Current Status of View Specification

Reply via email to