If we go with either UUID or Table Identifier + VersionID/SnapshotId in the
refresh state, then this list is fully expanded already.  So, to validate
the freshness of a materialization, the engine doesn't even need to look at
the view lineage.  IMO, the view lineage is nice to have but not a
necessary requirement for MVs.  The view lineage makes sharing of views
between engines without common SQL dialects possible.

Benny

On Thu, Aug 15, 2024 at 12:22 AM Jan Kaul <jank...@mailbox.org.invalid>
wrote:

> Hi all,
>
> I would like to reemphasize the purpose of the refresh-state for
> materialized views. The purpose is to determine if the precomputed data is
> fresh, stale or invalid. For that the current snapshot-id of every table in
> the query tree has to be fetched from the catalog by using its full
> identifier and ref. Additionally the refresh state stores the snapshot-id
> of the last refresh.
>
> To summarize: *To determine the freshness of the precomputed data we
> require the full identifier + ref and snapshot-id of the last refresh for
> every table in the fully expanded query tree*
>
> This is a requirement from how the catalog works and independent from how
> we design the lineage/refresh state. Additionally we previously agreed that
> we should be able to obtain the full list of identifiers without needing to
> parse the SQL definition.
>
> Now we are having a discussion in how to store and obtain the fully
> expanded list of table identifiers and snapshot-ids. To move the discussion
> forward I think it would be valuable to answer the following 3 questions:
>
> 1. Should we move the identifiers out of the refresh-state into a new
> lineage record that is stored as part of the view metadata?
>
> 2. If yes, should the lineage in the view be fully expanded?
>
> 3. What should be used as an identifier in the lineage to reference
> entries in the refresh-state?
>
> 1. Question:
>
> We already agreed that this would be a good idea because we wouldn't
> introduce the identifier concept to the table metadata. However, looking at
> the complexity that comes with the alternatives, I would like to keep this
> question open.
>
> 2. Question:
>
> I'm against using a not fully expanded lineage in the view struct. To
> recall we require every identifier in the fully expanded query tree to
> determine the freshness. Not storing all identifiers in the lineage would
> mean to recursively call the catalog and expand the query tree at read
> time. This can lead to a large overhead for determining the refresh state
> compared to expanding the query tree once at creation time and then storing
> the fully expanded lineage.
>
> 3. Question:
>
> This depends on Question 2.
>
> For a not fully expanded lineage, the only options would be uuids or
> catalog identifiers.
>
> For a fully expanded lineage the question isn't all that relevant. The
> current design specifies that the lineage is a map from an identifier to an
> id and the refresh-state is a map from such id to a snapshot-id. For this
> to work we don't have to specify which kind of identifier has to be used.
> One query engine could use uuids, the other engine sequence-ids. The
> important assumption we are making is that every id that is used in the
> refresh-state has to be defined in the lineage.
> So the question about using uuids is rather, can the query engine trust
> that the id defined in the lineage is the uuid of the table.
>
>
> Regarding the complexity that comes from introducing the lineage in the
> view I would like to revisit question 1. Introducing the lineage in the
> view metadata opens up the question of when should the lineage be fully
> expanded. We see that we have 3 options:
>
> 1. Not fully expanded lineage -> Expansion at read time
>
> 2. Fully expanded lineage -> Expansion at creation time
>
> 3. No lineage (use identifiers in refresh-state) -> Expansion at refresh
> time
>
> As reading is expected to be the most frequent operation I see option 1 as
> not favorable. As the query engine has to fully expand the query tree for a
> refresh anyway, I see option 3 as the most natural. For a refresh operation
> the query engine must understand the SQL dialects of all views in the query
> tree and therefore is guaranteed to successfully expand the lineage. This
> might not be the case at creation time, which makes option 2 less favorable.
>
> As can be seen, I'm in favor of just storing the refresh-state as a map
> from identifier to snapshot-id and not using the lineage. I know that this
> introduces the concept of a catalog identifiers to the table metadata spec,
> but in my opinion it is by far the simplest option.
>
> I'm interested in your opinions.
>
> Best wishes,
>
> Jan
> On 14.08.24 22:24, Walaa Eldin Moustafa wrote:
>
> Thanks Benny. For refs, I am +1 to represent them as UUID + optional ref,
> although we can iterate ohe exact JSON structure (e.g., another option is
> splitting for (UUID) state from (UUID + ref) state into two separate
> higher-level fields).
>
> Generally agree on REFRESH VIEW strategy could be up to the engine, but it
> seems like an area where Iceberg could have an opinion/spec on. I will
> start a separate thread for that.
>
> Thanks,
> Walaa.
>
>

Reply via email to