Hi Benny, on the staleness topic I'd recommend to check how Trino implements materialized views in Iceberg and how it defines staleness. In particular
- a view can have defined grace period which defines how stale the data can be for the materialization to be considered useful (defaults to unlimited) - staleness clock starts with the first table change after refresh - for unmanaged (non-iceberg) tables where we don't know when the table changed, the staleness clock starts right after refresh Best Piotr On Wed, 19 Jun 2024 at 19:58, Benny Chow <btc...@gmail.com> wrote: > Hey Guys, > > Great progress on the MV spec and thanks a ton to Jan and Walaa for > driving this. One of our latest achievements was that we finalized the > view lineage and materialization table refresh JSON so that we can > definitively and concisely describe what data is in the materialization > table. > > Regarding the actual refresh process, I have two more suggestions: > > *When should a MV be refreshed? *There could be many different refresh > policies such as "on table data or view change", periodic, scheduled and/or > manual with the goal of reducing staleness while minimizing cost to > refresh. I don't think we should try to capture this configuration as part > of the first iteration of the MV spec. So, I suggest we just remove the " > *materialization.data.max-staleness*" view property for now. There's a > lot of comments on this in the spec and many contributors did suggest to > not include it as well. > https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABFwRPGoA > > *When refresh is done, what information is stored to help engines evaluate > materialization freshness?* We agreed on storing the view lineage and > materialization refresh-tables so that engines can query for the current > table snapshot versions and view versions and compare to what is stored in > the refresh-tables. However, there could be 100s of tables/views here and > it could be prohibitively expensive to do this check. Instead, the engine > may just use the materialization's snapshot summary timestamp-ms to > determine the last refresh time and assume the data is fresh as of this > timestamp. However, this assumption might be naive if the refresh job took > 1 hour to run and source tables were queried at different times throughout > the execution of the job. So, I propose we add a " > *refresh-start-timestamp-ms*" to the materialization snapshot summary > which tells users that the data in the materialization is at least as fresh > as of this date (It might be fresher but not more stale). > > Thoughts? > > Thanks > Benny > > > > >