Looks like it wasn’t clear what I meant for the 3 categories, so I’ll
be more specific:
* /Separate table and view/: this option is to have the objects that
we have today, with extra metadata. Commit processes are separate:
committing to the table doesn’t alter the view and committing to
the view doesn’t change the table. However, changing the view can
make it so the table is no longer useful as a materialization.
* /A combination of a view and a table/: in this option, the table
metadata and view metadata are the same as the first option. The
difference is that the commit process combines them, either by
embedding a table metadata location in view metadata or by
tracking both in the same catalog reference.
* /A new metadata type/: this option is where we define a new
metadata object that has view attributes, like SQL
representations, along with table attributes, like partition specs
and snapshots.
Hopefully this is clear because I think much of the confusion is
caused by different definitions.
The LoadTableResponse having optional metadata-location field
implies that the object in the catalog no longer needs to hold a
metadata file pointer
The REST protocol has not removed the requirement for a metadata file,
so I’m going to keep focused on the MV design options.
When we say a MV can be a “new metadata type”, it does not mean it
needs to define a completely brand new structure of the metadata
content
I’m making a distinction between separate metadata files for the table
and the view and a combined metadata object, as above.
We can define an “Iceberg MV” to be an object in a catalog, which
has 1 table metadata file pointer, and 1 view metadata file pointer
This is the option I am referring to as a “combination of a view and a
table”.
So to review my initial email, I don’t see a reason why a combined
view and table is advantageous, either implemented by having a catalog
reference with two metadata locations or embedding a table metadata
location in view metadata. This would cause unnecessary dependence
between the view and table in catalogs. I guess there’s an argument
that you could load both table and view metadata locations at the same
time. That hardly seems worth the trouble given the recent issues with
adding views to the JDBC catalog.
I also think that once we decide on structure, we can make it possible
for REST catalog implementations to do smart things, in a way that
doesn’t put additional requirements on the underlying catalog store.
For instance, we could specify how to send additional objects in a
LoadViewResult, in case the catalog wants to pre-fetch table metadata.
I think these optimizations are a later addition, after we define the
relationship between views and tables.
Jack, it sounds like you’re the proponent of a combined table and view
(rather than a new metadata spec for a materialized view). What is the
main motivation? It seems like you’re convinced of that approach, but
I don’t understand the advantage it brings.
Ryan
On Thu, Feb 29, 2024 at 12:26 PM Szehon Ho <szehon.apa...@gmail.com>
wrote:
Hi
Yes I mostly agree with the assessment. To clarify a few minor
points.
is a materialized view a view and a separate table, a
combination of the two (i.e. commits are combined), or a new
metadata type?
For 'new metadata type', I consider mostly Jack's initial proposal
of a new Catalog MV object that has two references (ViewMetadata +
TableMetadata).
The arguments that I see for a combined materialized view
object are:
* Regular views are separate, rather than being tables with
SQL and no data so it would be inconsistent (“Iceberg view
is just a table with no data but with representations
defined. But we did not do that.”)
* Materialized views are different objects in DDL
* Tables may be a superset of functionality needed for
materialized views
* Tables are not typically exposed to end users — but this
isn’t required by the separate view and table option
For completeness, there seem to be a few additional ones
(mentioned in the Slack and above messages).
* Lack of spec change (to ViewMetadata). But as Jack says it is
a spec change (ie, to catalogs)
* A single call to get the View's StorageTable (versus two calls)
* A more natural API, no opportunity for user to call
Catalog.dropTable() and renameTable() on storage table
*Thoughts: *I think the long discussion sessions we had on Slack
was fruitful for me, as seeing the API clarified some things.
I was initially more in favor of MV being a new metadata type
(TableMetadata + ViewMetadata). But seeing most of the MV
operations end up being ViewCatalog or Catalog operations, I am
starting to think API-wise that it may not align with the new
metadata type (unless we define MVCatalog and /MV REST endpoints,
which then are boilerplate wrappers).
Initially one question I had for option 'a view and a separate
table', was how to make this table reference (metadata.json or
catalog reference). In the previous option, we had a precedent of
Catalog references to Metadata, but not pointers between
Metadatas. I initially saw the proposed Catalog's TableIdentifier
pointer as 'polluting' catalog concerns in ViewMetadata. (I saw
Catalog and ViewCatalog as a layer above TableMetadata and
ViewMetadata). But I think Dan in the Slack made a fair point
that ViewMetadata already is tightly bound with a Catalog. In
this case, I think this approach does have its merits as well in
aligning Catalog API's with the metadata.
Thanks
Szehon
On Thu, Feb 29, 2024 at 5:45 AM Jan Kaul
<jank...@mailbox.org.invalid> wrote:
Hi all,
I would like to provide my perspective on the question of what
a materialized view is and elaborate on Jack's recent proposal
to view a materialized view as a catalog concept.
Firstly, let's look at the role of the catalog. Every entity
in the catalog has a *unique identifier*, and the catalog
provides methods to create, load, and update these entities.
An important thing to note is that the catalog methods exhibit
two different behaviors: the *create and load methods deal
with the entire entity*, while the *update(commit) method only
deals with partial changes* to the entities.
In the context of our current discussion, materialized view
(MV) metadata is a union of view and table metadata. The fact
that the update method deals only with partial changes,
enables us to *reuse the existing methods for updating tables
and views*. For updates we don't have to define what
constitutes an entire materialized view. Changes to a
materialized view targeting the properties related to the view
metadata could use the update(commit) view method. Similarly,
changes targeting the properties related to the table metadata
could use the update(commit) table method. This is great news
because we don't have to redefine view and table commits
(requirements, updates).
This is shown in the fact that Jack uses the same operation to
update the storage table for Option 1 and 3:
// REST: POST /namespaces/db1/tables/mv1?materializedView=true
// non-REST: update JSON files at table_metadata_location
storageTable.newAppend().appendFile(...).commit();
The open question is *whether the create and load methods
should treat the properties that constitute the MV metadata as
two entities (View + Table) or one entity (new MV object)*.
This is all part of Jack's proposal, where Option 1 proposes a
new MV object, and Option 3 proposes two separate entities.
The advantage of Option 1 is that it doesn't require two
operations to load the metadata. On the other hand, the
advantage of Option 3 is that no new operations or catalogs
have to be defined.
In my opinion, defining a new representation for materialized
views (Option 1) is generally the cleaner solution. However, I
see a path where we could first introduce Option 3 and still
have the possibility to transition to Option 1 if needed. The
great thing about Option 3 is that it only requires minor
changes to the current spec and is mostly implementation detail.
Therefore I would propose small additions to Jacks Option 3
that only introduce changes to the spec that are not specific
to materialized views. The idea is to introduce boolean
properties to be set on the creation of the view and the
storage table that indicate that they belong to a materialized
view. The view property "materialized" is set to "true" for a
MV and "false" for a regular view. And the table property
"storage_table" is set to "true" for a storage table and
"false" for a regular table. The absence of these properties
indicates a regular view or table.
ViewCatalog viewCatalog = (ViewCatalog) catalog;
// REST: GET /namespaces/db1/views/mv1
// non-REST: load JSON file at metadata_location
View mv = viewCatalog.loadView(TableIdentifier.of("db1", "mv1"));
// REST: GET /namespaces/db1/tables/mv1
// non-REST: load JSON file at table_metadata_location if present
Table storageTable = view.storageTable();
// REST: POST /namespaces/db1/tables/mv1
// non-REST: update JSON file at table_metadata_location
storageTable.newAppend().appendFile(...).commit();
We could then introduce a new requirement for views and tables
called "AssertProperty" which could make sure to only perform
updates that are inline with materialized views. The
additional requirement can be seen as a general extension
which does not need to be changed if we decide to got with
Option 1 in the future.
Let me know what you think.
Best wishes,
Jan
On 29.02.24 04:09, Walaa Eldin Moustafa wrote:
Thanks Ryan for the insights. I agree that reusing existing
metadata definitions and minimizing spec changes are very
important. This also minimizes spec drift (between
materialized views and views spec, and between materialized
views and tables spec), and simplifies the implementation.
In an effort to take the discussion forward with concrete
design options based on an end-to-end implementation, I have
prototyped the implementation (and added Spark support) in
this PR https://github.com/apache/iceberg/pull/9830. I hope
it helps us reach convergence faster. More details about some
of the design options are discussed in the description of the
PR.
Thanks,
Walaa.
On Wed, Feb 28, 2024 at 6:20 PM Ryan Blue <b...@tabular.io>
wrote:
I mean separate table and view metadata that is somehow
combined through a commit process. For instance, keeping
a pointer to a table metadata file in a view metadata
file or combining commits to reference both. I don't see
the value in either option.
On Wed, Feb 28, 2024 at 5:05 PM Jack Ye
<yezhao...@gmail.com> wrote:
Thanks Ryan for the help to trace back to the root
question! Just a clarification question regarding
your reply before I reply further: what exactly does
the option "a combination of the two (i.e. commits
are combined)" mean? How is that different from "a
new metadata type"?
-Jack
On Wed, Feb 28, 2024 at 2:10 PM Ryan Blue
<b...@tabular.io> wrote:
I’m catching up on this conversation, so
hopefully I can bring a fresh perspective.
Jack already pointed out that we need to start
from the basics and I agree with that. Let’s
remove voting at this point. Right now is the
time for discussing trade-offs, not lining up and
taking sides. I realize that wasn’t the intent
with adding a vote, but that’s almost always the
result. It’s too easy to use it as a stand-in for
consensus and move on prematurely. I get the
impression from the swirl in Slack that
discussion has moved ahead of agreement.
We’re still at the most basic question: is a
materialized view a view and a separate table, a
combination of the two (i.e. commits are
combined), or a new metadata type?
For now, I’m ignoring whether the “separate
table” is some kind of “system table” (meaning
hidden?) or if it is exposed in the catalog.
That’s a later choice (already pointed out) and,
I suspect, it should be delegated to catalog
implementations.
To simplify this a little, I think that we can
eliminate the option to combine table and view
commits. I don’t think there is a reason to
combine the two. If separate, a table would track
the view version used along with freshness
information for referenced tables. If the table
is automatically skipped when the version no
longer matches the view, then no action needs to
happen when a view definition changes. Similarly,
the table can be updated independently without
needing to also swap view metadata. This also
aligns with the idea from the original doc that
there can be multiple materialization tables for
a view. Each should operate independently unless
I’m missing something
I don’t think the last paragraph’s conclusion is
contentious so I’ll move on, but please stop here
and reply if you disagree!
That leaves the main two options, a view and a
separate table linked by metadata, or, combined
materialized view metadata.
As the doc notes, the separate view and table
option is simpler because it reuses existing
metadata definitions and falls back to simple
views. That is a significantly smaller spec and
small is very, very important when it comes to
specs. I think that the argument for a new
definition of a materialized view needs to
overcome this disadvantage.
The arguments that I see for a combined
materialized view object are:
* Regular views are separate, rather than being
tables with SQL and no data so it would be
inconsistent (“Iceberg view is just a table
with no data but with representations
defined. But we did not do that.”)
* Materialized views are different objects in DDL
* Tables may be a superset of functionality
needed for materialized views
* Tables are not typically exposed to end users
— but this isn’t required by the separate
view and table option
Am I missing any arguments for combined metadata?
Ryan
--
Ryan Blue
Tabular
--
Ryan Blue
Tabular
--
Ryan Blue
Tabular