Re: Current Status of View Specification

Micah Kornfield Tue, 09 May 2023 10:55:07 -0700

>
> Yeah, I think we have to go with option 2. Otherwise you'd be embedding
> catalog config in views and it starts getting really ugly.
>
> I don't really see a path where we don't assume that you've used
> consistent naming across engines. Not using consistent names would be
> weird, and using names that actually conflict would be even worse. If you
> want reliability you have to use the same names for views, as well as
> people.
>


I think this is reasonable (I put one last argument below, but that is
mostly to ask a follow-up question on the intent of what is being
captured).

It would be bonkers to see catalogs mean different things in different
> places.


I agree it wouldn't be ideal, but given that this is configuration I
wouldn't be surprised if it gets muddled by at least one user.  The more
likely case is a user for legacy reasons has catalog/namespaces that aren't
compatible across all engines (e.g. multiple namespaces for engines that
only support 1).  But these problems can be solved in other ways by the
user.

Not only that, this is also how columns work. I don't know of anything that
> tracks view columns by their context and ID. If your view is "select a, b
> from table" and "a" is dropped and re-created, you get the new one right?


In general yes the new column would be picked up, however, I believe some
DBs (mostly OLTP DB's like DB2) can mark the view invalid on any schema
change, and depending on configuration require manual intervention to fix
it (so the view would could fail even if "a" was added back).

Also it is interesting to note that the example "select a, b from table",
relies on a SQL engine's inference abilities to avoid fully qualified names
and infer that the columns are from a specific table.  It doesn't seem
completely unreasonable to allow for a similar type of inference that
allows rebinding when the catalog/namespace scope is not specifically
captured if the aim is to capture symbolic representation.  The main
argument I see against this is it allows the semantics of the view to
potentially drift from the originally intended SQL at view creation time.

This raises another question, it seems the focus of the specification is
only on capturing the original SQL and not a SQL representation that
requires less effort to interpret?  For instance, Hive (and some other DBMS
systems) will rewrite view queries to better capture semantics (e.g.
"select * from table", assuming table has 2 columns: "A" and "B", this
would get rewritten as "select table.A, table.B from
<catalog>.<schema>.table"[1].  The existing view specification doesn't seem
to capture enough information to limit the query if the schema on "table"
changes after view creation in a backwards compatible way).

Should there be an "original SQL" and a "normalized SQL" field in the
specification to allow for these rewrites?  Or is the current SQL field
intended to capture the rewritten/normalized version?

Thanks,
Micah

[1] I'm not sure if Hive does full table qualification

On Thu, May 4, 2023 at 4:53 PM Ryan Blue <b...@tabular.io> wrote:

> > The assumption helps if all tables are in the same catalog, and the view
> is also in catalog but this assumption can potentially break down
>
> Yeah, I think we have to go with option 2. Otherwise you'd be embedding
> catalog config in views and it starts getting really ugly.
>
> I don't really see a path where we don't assume that you've used
> consistent naming across engines. Not using consistent names would be
> weird, and using names that actually conflict would be even worse. If you
> want reliability you have to use the same names for views, as well as
> people. It would be bonkers to see catalogs mean different things in
> different places.
>
> Not only that, this is also how columns work. I don't know of anything
> that tracks view columns by their context and ID. If your view is "select
> a, b from table" and "a" is dropped and re-created, you get the new one
> right?
>
> On Tue, May 2, 2023 at 4:40 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Sorry for the late reply.
>>
>> I put together https://github.com/apache/iceberg/pull/7504/files to
>> cover some of the answers here.
>>
>> To follow-up on:
>>
>>> > 1.  If `default-catalog`, `default-namespace` are not specified I
>>> would guess it is assumed that any non-fully qualified tables should be
>>> resolved against the same catalog and namespace that the view is
>>> registered in?
>>> I think that names should be fully qualified if these are not set. I
>>> don't think that we want to have the engine guess based on the location of
>>> the view, but I'd be open to discussion to clarify the spec on this.
>>
>>
>> To clarify here, the use-case I'm considering here is federation across
>> catalogs systems.  The assumption helps if all tables are in the same
>> catalog, and the view is also in catalog but this assumption can
>> potentially break down.  Off the top of my head it seems like there are two
>> options to not rely on implicit behavior:
>> 1.  Include configuration in the view definitions on what system a
>> catalog in a table name refers to (e.g. catalog xyz=HMS endpoint, catalog
>> abc=Iceberg REST endpoint, etc).
>> 2.  Declare this problem out of scope for the view specification and
>> require users to ensure all query engines have their catalogs configured in
>> a uniform way (this seems somewhat painful of an end-user experience).
>>
>> Thanks,
>> Micah
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Mar 15, 2023 at 10:56 PM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> Thanks Ryan for the quick response.
>>>
>>> I think that names should be fully qualified if these are not set. I
>>>> don't think that we want to have the engine guess based on the location of
>>>> the view, but I'd be open to discussion to clarify the spec on this.
>>>
>>> One issue I think with supporting multiple dialects is that there are
>>> varying supports for how many namespaces they support.  And if someone is
>>> designing a workflow to generate SQL in multiple dialects they might not
>>> know apriori how to map from one set of namespaces to the other set.  Being
>>> able to late bind here, to know that all tables references are under the
>>> same namespace could be a work-around for this issue but.
>>>
>>> We have a "dialect" field on the SQL view representation, and we should
>>>> document what goes into that field. I think it makes sense to have a
>>>> version of the dialect as well.
>>>
>>>
>>> Right, the situation I would like to avoid is having different people
>>> choosing slightly different naming schemas for the same engine (i.e.
>>> "Spark", "spark" and "spark-sql").  It might pay to have namespacing scheme
>>> that allows for people freedom to choose whatever they want, but reserving
>>> a prefix for well understood engines in the Iceberg community (e.g. a
>>> prefix of "iceberg-engine." could be reserved as denoting engines that the
>>> community has officially agreed on naming for, so "iceberg-engine.spark" to
>>> continue the analogy)
>>>
>>> I'll try to put together a pull request on the other items as they don't
>>> seem like they warrant more conversation.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Wed, Mar 15, 2023 at 1:56 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> > 1.  If `default-catalog`, `default-namespace` are not specified I
>>>> would guess it is assumed that any non-fully qualified tables should be
>>>> resolved against the same catalog and namespace that the view is registered
>>>> in?
>>>>
>>>> I think that names should be fully qualified if these are not set. I
>>>> don't think that we want to have the engine guess based on the location of
>>>> the view, but I'd be open to discussion to clarify the spec on this.
>>>>
>>>> > 2.  It seems like there are two optional methods for specifying
>>>> documentation for columns in the view, one is field-docs, and one is the
>>>> doc field on schema.  Is the assumption that only one or the other would be
>>>> set?
>>>>
>>>> The field-docs take precedence. The format we're using follows the SQL
>>>> syntax, where you can create a view with top-level column names and field
>>>> docs. We store everything that comes through to avoid losing information.
>>>>
>>>> > 3.  Are field-aliases intended to be further documentation or
>>>> consumed programmatically (i.e. can they be referenced in queries that use
>>>> the view)?
>>>>
>>>> These are also from the SQL syntax and should be applied like field
>>>> comments to rename fields.
>>>>
>>>> > 4.  Are there any thoughts on standardization of dialects, at least
>>>> for common OSS query engines?  Should there be an option to encode dialect
>>>> version encoded at least informationally in case a view uses a SQL feature
>>>> that was only introduced in later versions?
>>>>
>>>> We have a "dialect" field on the SQL view representation, and we should
>>>> document what goes into that field. I think it makes sense to have a
>>>> version of the dialect as well.
>>>>
>>>> On Tue, Mar 14, 2023 at 11:48 PM Micah Kornfield <emkornfi...@gmail.com>
>>>> wrote:
>>>>
>>>>> I apologize if some of these items are already discussed or too
>>>>> obvious but a few questions after reading the spec in a little more 
>>>>> detail:
>>>>> 1.  If `default-catalog`, `default-namespace` are not specified I
>>>>> would guess it is assumed that any non-fully qualified tables should be
>>>>> resolved against the same catalog and namespace that the view is 
>>>>> registered
>>>>> in?
>>>>> 2.  It seems like there are two optional methods for specifying
>>>>> documentation for columns in the view, one is field-docs, and one is the
>>>>> doc field on schema.  Is the assumption that only one or the other would 
>>>>> be
>>>>> set?
>>>>> 3.  Are field-aliases intended to be further documentation or consumed
>>>>> programmatically (i.e. can they be referenced in queries that use the 
>>>>> view)?
>>>>> 4.  Are there any thoughts on standardization of dialects, at least
>>>>> for common OSS query engines?  Should there be an option to encode
>>>>> dialect version encoded at least informationally in case a view uses a SQL
>>>>> feature that was only introduced in later versions?
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> On Tue, Mar 14, 2023 at 8:02 PM John Zhuge <jzh...@apache.org> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> On Tue, Mar 14, 2023 at 7:31 PM Walaa Eldin Moustafa <
>>>>>> wa.moust...@gmail.com> wrote:
>>>>>>
>>>>>>> +1 to get a basic implementation in. Some of the
>>>>>>> discussions/feedback on the API PR slightly changed the API from the
>>>>>>> initial proposed API that probably more closely resembled Netflix's
>>>>>>> implementation. Getting an implementation going on the finalized APIs 
>>>>>>> could
>>>>>>> give some good feedback to the spec or the finalized APIs.
>>>>>>>
>>>>>>> On Tue, Mar 14, 2023 at 7:10 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>>> Thanks for the detailed update, Amogh!
>>>>>>>>
>>>>>>>> I think that the view spec is close, but it will likely change as
>>>>>>>> we get the implementation done. Netflix has a working implementation 
>>>>>>>> that
>>>>>>>> we're basing this off of, so I think it won't change significantly, 
>>>>>>>> but we
>>>>>>>> don't want to vote to adopt the spec before it's ready.
>>>>>>>>
>>>>>>>> The next steps are to get the implementation done, which I think is
>>>>>>>> a high priority now that branches and tags are being released in 1.2.0.
>>>>>>>> We'd love more help implementing this, and especially help 
>>>>>>>> implementing the
>>>>>>>> spec in other languages because that will definitely catch things we
>>>>>>>> haven't seen yet.
>>>>>>>>
>>>>>>>> Let's talk about this at the sync tomorrow.
>>>>>>>>
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>> On Tue, Mar 14, 2023 at 9:41 AM Jahagirdar, Amogh
>>>>>>>> <jaham...@amazon.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi Micah,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for starting this discussion. I also am not aware of a
>>>>>>>>> formal vote thread for the View Specification. I’m open to starting a
>>>>>>>>> formal vote if one wasn’t already done to ensure we have made all the 
>>>>>>>>> right
>>>>>>>>> considerations across the community.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Currently, we are working on the view metadata parser
>>>>>>>>> implementations based on the spec as it’s defined today
>>>>>>>>> <https://iceberg.apache.org/view-spec/>. A few components have
>>>>>>>>> already been merged but the root level metadata parser is still WIP. 
>>>>>>>>> So if
>>>>>>>>> there’s more discussion on aspects of the spec which we want to 
>>>>>>>>> revisit, we
>>>>>>>>> should do that before exposing the root level metadata parser.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For blockers to adoption, there’s a few implementations we need to
>>>>>>>>> complete after we conclude on the spec:
>>>>>>>>>
>>>>>>>>> 1.) Completing the parser implementations. Right now, the view version
>>>>>>>>> parser implementation
>>>>>>>>> <https://github.com/apache/iceberg/pull/6861> is still in
>>>>>>>>> progress, and after this we’d do the overall view metadata parser, and
>>>>>>>>> expose that in the library.
>>>>>>>>> 2.) A “BaseMetastoreViewCatalog” or something of that sort where
>>>>>>>>> common view operation implementations can be performed for metastore 
>>>>>>>>> based
>>>>>>>>> catalogs.
>>>>>>>>>
>>>>>>>>> Then I think the community can progress on the different view
>>>>>>>>> catalog implementations and engine integration to further help 
>>>>>>>>> adoption.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Open to suggestions and ideas here!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *Micah Kornfield <emkornfi...@gmail.com>
>>>>>>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
>>>>>>>>> *Date: *Monday, March 13, 2023 at 6:17 PM
>>>>>>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org>
>>>>>>>>> *Subject: *[EXTERNAL] Current Status of View Specification
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *CAUTION*: This email originated from outside of the
>>>>>>>>> organization. Do not click links or open attachments unless you can 
>>>>>>>>> confirm
>>>>>>>>> the sender and know the content is safe.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Iceberg Dev,
>>>>>>>>>
>>>>>>>>> I see the spec has been checked in but I couldn't find a vote
>>>>>>>>> thread ratifying it as a final V1 version  (I might have been using 
>>>>>>>>> the
>>>>>>>>> wrong search terms) but for other additions of things like puffin it 
>>>>>>>>> seemed
>>>>>>>>> like there was an official vote.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Should the spec be considered finalized as a V1 version now?  Was
>>>>>>>>> there a vote held?  Will there be one?  Are there any blockers to 
>>>>>>>>> adoption?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Micah
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> John Zhuge
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>
> --
> Ryan Blue
> Tabular
>

Re: Current Status of View Specification

Reply via email to