Hi Walaa,

as you pointed out, design 1 in the github issue <https://github.com/apache/iceberg/issues/6420> with a common view and a linked storage table seems to be the most promising going forward. I therefore put together an initial proposal for a specification.

I realize that my proposal deviates from the design Trino is using at the moment. I want to point out that this is just a proposal and is open for discussion. The changes I suggest are twofold:

1. Change

Register only the view in the iceberg catalog. Currently the common view as well as the storage table are registered in the catalog. To assure atomic transaction for the storage table I propose the following commit procedure:

Writers changing the storage table state create table metadata files
optimistically, assuming that the current metadata location will not be changed
before the writer’s commit. The commit is performed in two steps. First, the
Writer creates a new view metadata file optimistically and changes the storage
table pointer to the new location. Second, the new view metadata file gets
committed by swapping the view’s metadata file pointer in the metastore from the base location to its new location. The commit is only considered successful when the second
step succeeds.

2. Change

A different format for storing the refresh information. The proposed metadata captures more information and allows easier additions in the future.

I have summarized the changes in the proposal at the end of the email. I would be glad if you could have a look.

Best wishes,

Jan


 Design 1


   Overview

MVs (Materialized views) are realized as a combination of an iceberg common view
with an underlying storage table. The definition of the materialized view is
stored in the common view. The precomputed data is stored in an iceberg table called storage table. The information required for refresh operations is stored
as a property in the storage table. All changes to either the view or the
storage table state create a new view metadata file and completely replace the old view metadata file using an atomic swap. Like Iceberg tables and views, this atomic swap is delegated to the metastore that tracks tables and views by name.


     Metadata Location

An atomic swap of one view metadata file for another provides the basis for
making atomic changes. Readers use the version of the view that was current when they loaded the view metadata and are not affected by changes until they refresh
and pick up a new metadata location.

Writers distinguish between changing the view or the storage table state.

Writers changing the view state create view metadata files optimistically,
assuming that the current metadata location will not be changed before the
writer’s commit. Once a writer has created an update, it commits by swapping the
view’s metadata file pointer from the base location to the new location.

Writers changing the storage table state create table metadata files
optimistically, assuming that the current metadata location will not be changed
before the writer’s commit. The commit is performed in two steps. First, the
Writer creates a new view metadata file optimistically and changes the storage
table pointer to the new location. Second, the new view metadata file gets
committed by swapping the view’s metadata file pointer in the metastore from the base location to its new location. The commit is only successful when the second
step succeeds.


   Specification (DRAFT!)

The metadata of the materialized view is comprised of four parts. The view and the storage table metadata constitute one part each. Since not all information can be stored inside the view and storage table metadata, two additional parts are introduced in the |properties| field of the view and storage table metadata
respectively.


     Materialized view metadata stored in the common view properties

One part of the materialized view metadata is stored inside the |properties|
field of the common view. The metadata is stored in JSON format under the key “materialized_view_metadata”. The materialized view metadata stored in the view
has the following schema.

v1      Field Name      Description
/required/ *|storage-table-location|* Path to the metadata file of the storage table. /optional/ *|allow-stale-data|* Boolean that defines the query engine behavior in case the base tables indicate the precomputed data isn’t fresh. If set to FALSE, a refresh operation has to be performed before the query results are returned. If set to TRUE the data in the storage table gets returned without performing a refresh operation. If field is not set, defaults to FALSE. /optional/ *|refresh-strategy|* Possible values are: |full|: Full storage table refresh, |incremental|: Incremental table refresh. If field is not set, defaults to |full|


     Materialized view metadata stored in the storage table properties

Another part of the materialized view metadata is stored inside the |properties| field of the storage table. The metadata is stored in JSON format under the key
“materialized_view_metadata”. The materialized view metadata stored in the
storage table has the following schema.

v1      Field Name      Description
/required/      *|refreshes|*   A list of refresh operations.
/required/ *|current-refresh-id|* Id of the last refresh operation that defines the current state of the data files.


       *Refreshes*

Refresh information is stored as a list of |refresh operation| records. Each
|refresh operation| has the following structure:

v1      Field Name      Description
/required/      *|refresh-id|*  ID of the refresh operation.
/required/ *|version-id|* Version id of the materialized view when the refresh operation was performed.
/required/      *|base-tables|*         A List of |base-table| records.
/optional/ *|sequence-number|* Sequence number of the snapshot that contains the refreshed data files.

Refreshes could be handled in different ways. For a normal execution the refresh list could consist of only one entry, which gets overwritted on every refresh
operation. If “timetravel” is enabled for the materialized view, a new
|refresh operation| record gets inserted into the list on every refresh.
Together with the |sequence-number| field, this could be used to track the
evolution of data files over the refresh history.


       *Base table*

A |base table| record can have different forms based on the common field “type”.
The other fields don’t necessarily have to be the same.


       Iceberg-Metastore

v1      Field Name      Description
/required/      *|type|*        type=”iceberg-metastore”
/required/      *|identifier|*  Identifier in the SQL expression.
/required/ *|snapshot-reference|* Snapshot id of the base table when the refresh operation was performed. /optional/ *|properties|* A string to string map of base table properties. Could be used to specify a different metastore.


       Iceberg-FileSystem

v1      Field Name      Description
/required/      *|type|*        type=”iceberg-filesystem”
/required/      *|identifier|*  Identifier in the SQL expression.
/required/      *|location|*    Path to the directory of the base table.
/required/ *|snapshot-reference|* Snapshot id of the base table when the refresh operation was performed. /optional/ *|properties|* A string to string map of base table properties. Could be used for a different storage system.


       DeltaLake-FileSystem (optional)

v1      Field Name      Description
/required/      *|type|*        type=”deltalake-filesystem”
/required/      *|identifier|*  Identifier in the SQL expression.
/required/      *|location|*    Path to the directory of the base table.
/required/ *|snapshot-reference|* Delta table version of the base table when the refresh operation was performed. /optional/ *|properties|* A string to string map of base table properties. Could be used for a different storage system.

Reply via email to