Piotr, thanks for the Trino pointers. I noticed that Trino stores the refresh start time as a snapshot summary property here <https://github.com/trinodb/trino/blob/6697fe24481a30d37eb91efd62666165acf379c2/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L3027>. I think this is exactly what I am asking for with "refresh-start-timestamp-ms".
Walaa, no, the suggestion is to not have a grace period as this is engine specific on how it wants to handle staleness. "refresh-start-timestamp-ms" refers to 1 and not 2. We should already have 2 in the snapshot summary timestamp-ms property. When I say "fresh as of", I don't mean the AS OF construct. It's just making a guarantee to the consumer of the MV that the materialization contains data that is "fresh as of" a certain timestamp. So like, if you built a materialization on top of 100 tables (possibly a mix of Iceberg and non-Iceberg) and you know that the refresh job ran on say 6/20/2024 12:02:10 UTC, then whatever data is in the materialization has to be "fresh as of" 6/20/2024 12:02:10 UTC. Thanks Benny On Thu, Jun 20, 2024 at 11:19 AM Walaa Eldin Moustafa <wa.moust...@gmail.com> wrote: > Benny, is the suggestion to couple the "refresh-start-timestamp-ms" > property with a grace period as well? Also, could you clarify which > timestamp "refresh-start-timestamp-ms" refers to: > (1) Timestamp when refresh is triggered > (2) Timestamp when refresh is concluded and the snapshot is written. > > Also when you say "fresh as of" this timestamp, do you mean "AS OF" > construct when used to query the materialized view? Or something else? If > "AS OF" is what you meant, then this might answer my question about the > grace period, where it won't be needed. > > Thanks, > Walaa. > > > > On Thu, Jun 20, 2024 at 5:22 AM Piotr Findeisen <piotr.findei...@gmail.com> > wrote: > >> Hi Benny, >> >> on the staleness topic I'd recommend to check how Trino implements >> materialized views in Iceberg and how it defines staleness. >> In particular >> >> - a view can have defined grace period which defines how stale the data >> can be for the materialization to be considered useful (defaults to >> unlimited) >> - staleness clock starts with the first table change after refresh >> - for unmanaged (non-iceberg) tables where we don't know when the table >> changed, the staleness clock starts right after refresh >> >> Best >> Piotr >> >> >> >> >> >> On Wed, 19 Jun 2024 at 19:58, Benny Chow <btc...@gmail.com> wrote: >> >>> Hey Guys, >>> >>> Great progress on the MV spec and thanks a ton to Jan and Walaa for >>> driving this. One of our latest achievements was that we finalized the >>> view lineage and materialization table refresh JSON so that we can >>> definitively and concisely describe what data is in the materialization >>> table. >>> >>> Regarding the actual refresh process, I have two more suggestions: >>> >>> *When should a MV be refreshed? *There could be many different refresh >>> policies such as "on table data or view change", periodic, scheduled and/or >>> manual with the goal of reducing staleness while minimizing cost to >>> refresh. I don't think we should try to capture this configuration as part >>> of the first iteration of the MV spec. So, I suggest we just remove the " >>> *materialization.data.max-staleness*" view property for now. There's a >>> lot of comments on this in the spec and many contributors did suggest to >>> not include it as well. >>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABFwRPGoA >>> >>> *When refresh is done, what information is stored to help engines >>> evaluate materialization freshness?* We agreed on storing the view >>> lineage and materialization refresh-tables so that engines can query for >>> the current table snapshot versions and view versions and compare to what >>> is stored in the refresh-tables. However, there could be 100s of >>> tables/views here and it could be prohibitively expensive to do this >>> check. Instead, the engine may just use the materialization's snapshot >>> summary timestamp-ms to determine the last refresh time and assume the data >>> is fresh as of this timestamp. However, this assumption might be naive if >>> the refresh job took 1 hour to run and source tables were queried at >>> different times throughout the execution of the job. So, I propose we add >>> a "*refresh-start-timestamp-ms*" to the materialization snapshot >>> summary which tells users that the data in the materialization is at least >>> as fresh as of this date (It might be fresher but not more stale). >>> >>> Thoughts? >>> >>> Thanks >>> Benny >>> >>> >>> >>> >>> >>