Hey Guys, Great progress on the MV spec and thanks a ton to Jan and Walaa for driving this. One of our latest achievements was that we finalized the view lineage and materialization table refresh JSON so that we can definitively and concisely describe what data is in the materialization table.
Regarding the actual refresh process, I have two more suggestions: *When should a MV be refreshed? *There could be many different refresh policies such as "on table data or view change", periodic, scheduled and/or manual with the goal of reducing staleness while minimizing cost to refresh. I don't think we should try to capture this configuration as part of the first iteration of the MV spec. So, I suggest we just remove the " *materialization.data.max-staleness*" view property for now. There's a lot of comments on this in the spec and many contributors did suggest to not include it as well. https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABFwRPGoA *When refresh is done, what information is stored to help engines evaluate materialization freshness?* We agreed on storing the view lineage and materialization refresh-tables so that engines can query for the current table snapshot versions and view versions and compare to what is stored in the refresh-tables. However, there could be 100s of tables/views here and it could be prohibitively expensive to do this check. Instead, the engine may just use the materialization's snapshot summary timestamp-ms to determine the last refresh time and assume the data is fresh as of this timestamp. However, this assumption might be naive if the refresh job took 1 hour to run and source tables were queried at different times throughout the execution of the job. So, I propose we add a " *refresh-start-timestamp-ms*" to the materialization snapshot summary which tells users that the data in the materialization is at least as fresh as of this date (It might be fresher but not more stale). Thoughts? Thanks Benny