Hi Walaa,
as you pointed out, design 1 in the github issue
<https://github.com/apache/iceberg/issues/6420> with a common view and a
linked storage table seems to be the most promising going forward. I
therefore put together an initial proposal for a specification.
I realize that my proposal deviates from the design Trino is using at
the moment. I want to point out that this is just a proposal and is open
for discussion. The changes I suggest are twofold:
1. Change
Register only the view in the iceberg catalog. Currently the common view
as well as the storage table are registered in the catalog. To assure
atomic transaction for the storage table I propose the following commit
procedure:
Writers changing the storage table state create table metadata files
optimistically, assuming that the current metadata location will not be
changed
before the writer’s commit. The commit is performed in two steps. First, the
Writer creates a new view metadata file optimistically and changes the
storage
table pointer to the new location. Second, the new view metadata file gets
committed by swapping the view’s metadata file pointer in the metastore
from the
base location to its new location. The commit is only considered
successful when the second
step succeeds.
2. Change
A different format for storing the refresh information. The proposed
metadata captures more information and allows easier additions in the
future.
I have summarized the changes in the proposal at the end of the email. I
would be glad if you could have a look.
Best wishes,
Jan
Design 1
Overview
MVs (Materialized views) are realized as a combination of an iceberg
common view
with an underlying storage table. The definition of the materialized view is
stored in the common view. The precomputed data is stored in an iceberg
table
called storage table. The information required for refresh operations is
stored
as a property in the storage table. All changes to either the view or the
storage table state create a new view metadata file and completely
replace the
old view metadata file using an atomic swap. Like Iceberg tables and
views, this
atomic swap is delegated to the metastore that tracks tables and views
by name.
Metadata Location
An atomic swap of one view metadata file for another provides the basis for
making atomic changes. Readers use the version of the view that was
current when
they loaded the view metadata and are not affected by changes until they
refresh
and pick up a new metadata location.
Writers distinguish between changing the view or the storage table state.
Writers changing the view state create view metadata files optimistically,
assuming that the current metadata location will not be changed before the
writer’s commit. Once a writer has created an update, it commits by
swapping the
view’s metadata file pointer from the base location to the new location.
Writers changing the storage table state create table metadata files
optimistically, assuming that the current metadata location will not be
changed
before the writer’s commit. The commit is performed in two steps. First, the
Writer creates a new view metadata file optimistically and changes the
storage
table pointer to the new location. Second, the new view metadata file gets
committed by swapping the view’s metadata file pointer in the metastore
from the
base location to its new location. The commit is only successful when
the second
step succeeds.
Specification (DRAFT!)
The metadata of the materialized view is comprised of four parts. The
view and
the storage table metadata constitute one part each. Since not all
information
can be stored inside the view and storage table metadata, two additional
parts
are introduced in the |properties| field of the view and storage table
metadata
respectively.
Materialized view metadata stored in the common view properties
One part of the materialized view metadata is stored inside the |properties|
field of the common view. The metadata is stored in JSON format under
the key
“materialized_view_metadata”. The materialized view metadata stored in
the view
has the following schema.
v1 Field Name Description
/required/ *|storage-table-location|* Path to the metadata file of the
storage table.
/optional/ *|allow-stale-data|* Boolean that defines the query engine
behavior in case the base tables indicate the precomputed data isn’t
fresh. If set to FALSE, a refresh operation has to be performed before
the query results are returned. If set to TRUE the data in the storage
table gets returned without performing a refresh operation. If field is
not set, defaults to FALSE.
/optional/ *|refresh-strategy|* Possible values are: |full|: Full
storage table refresh, |incremental|: Incremental table refresh. If
field is not set, defaults to |full|
Materialized view metadata stored in the storage table properties
Another part of the materialized view metadata is stored inside the
|properties|
field of the storage table. The metadata is stored in JSON format under
the key
“materialized_view_metadata”. The materialized view metadata stored in the
storage table has the following schema.
v1 Field Name Description
/required/ *|refreshes|* A list of refresh operations.
/required/ *|current-refresh-id|* Id of the last refresh operation
that defines the current state of the data files.
*Refreshes*
Refresh information is stored as a list of |refresh operation| records. Each
|refresh operation| has the following structure:
v1 Field Name Description
/required/ *|refresh-id|* ID of the refresh operation.
/required/ *|version-id|* Version id of the materialized view when the
refresh operation was performed.
/required/ *|base-tables|* A List of |base-table| records.
/optional/ *|sequence-number|* Sequence number of the snapshot that
contains the refreshed data files.
Refreshes could be handled in different ways. For a normal execution the
refresh
list could consist of only one entry, which gets overwritted on every
refresh
operation. If “timetravel” is enabled for the materialized view, a new
|refresh operation| record gets inserted into the list on every refresh.
Together with the |sequence-number| field, this could be used to track the
evolution of data files over the refresh history.
*Base table*
A |base table| record can have different forms based on the common field
“type”.
The other fields don’t necessarily have to be the same.
Iceberg-Metastore
v1 Field Name Description
/required/ *|type|* type=”iceberg-metastore”
/required/ *|identifier|* Identifier in the SQL expression.
/required/ *|snapshot-reference|* Snapshot id of the base table when
the refresh operation was performed.
/optional/ *|properties|* A string to string map of base table
properties. Could be used to specify a different metastore.
Iceberg-FileSystem
v1 Field Name Description
/required/ *|type|* type=”iceberg-filesystem”
/required/ *|identifier|* Identifier in the SQL expression.
/required/ *|location|* Path to the directory of the base table.
/required/ *|snapshot-reference|* Snapshot id of the base table when
the refresh operation was performed.
/optional/ *|properties|* A string to string map of base table
properties. Could be used for a different storage system.
DeltaLake-FileSystem (optional)
v1 Field Name Description
/required/ *|type|* type=”deltalake-filesystem”
/required/ *|identifier|* Identifier in the SQL expression.
/required/ *|location|* Path to the directory of the base table.
/required/ *|snapshot-reference|* Delta table version of the base
table when the refresh operation was performed.
/optional/ *|properties|* A string to string map of base table
properties. Could be used for a different storage system.