Hi Benny,

on the staleness topic I'd recommend to check how Trino implements
materialized views in Iceberg and how it defines staleness.
In particular

- a view can have defined grace period which defines how stale the data can
be for the materialization to be considered useful (defaults to unlimited)
- staleness clock starts with the first table change after refresh
- for unmanaged (non-iceberg) tables where we don't know when the table
changed, the staleness clock starts right after refresh

Best
Piotr





On Wed, 19 Jun 2024 at 19:58, Benny Chow <btc...@gmail.com> wrote:

> Hey Guys,
>
> Great progress on the MV spec and thanks a ton to Jan and Walaa for
> driving this.  One of our latest achievements was that we finalized the
> view lineage and materialization table refresh JSON so that we can
> definitively and concisely describe what data is in the materialization
> table.
>
> Regarding the actual refresh process, I have two more suggestions:
>
> *When should a MV be refreshed?  *There could be many different refresh
> policies such as "on table data or view change", periodic, scheduled and/or
> manual with the goal of reducing staleness while minimizing cost to
> refresh.  I don't think we should try to capture this configuration as part
> of the first iteration of the MV spec.  So, I suggest we just remove the "
> *materialization.data.max-staleness*" view property for now.  There's a
> lot of comments on this in the spec and many contributors did suggest to
> not include it as well.
> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABFwRPGoA
>
> *When refresh is done, what information is stored to help engines evaluate
> materialization freshness?*  We agreed on storing the view lineage and
> materialization refresh-tables so that engines can query for the current
> table snapshot versions and view versions and compare to what is stored in
> the refresh-tables.  However, there could be 100s of tables/views here and
> it could be prohibitively expensive to do this check.  Instead, the engine
> may just use the materialization's snapshot summary timestamp-ms to
> determine the last refresh time and assume the data is fresh as of this
> timestamp.  However, this assumption might be naive if the refresh job took
> 1 hour to run and source tables were queried at different times throughout
> the execution of the job.  So, I propose we add a "
> *refresh-start-timestamp-ms*" to the materialization snapshot summary
> which tells users that the data in the materialization is at least as fresh
> as of this date  (It might be fresher but not more stale).
>
> Thoughts?
>
> Thanks
> Benny
>
>
>
>
>

Reply via email to