Piotr, thanks for the Trino pointers.  I noticed that Trino stores the
refresh start time as a snapshot summary property here
<https://github.com/trinodb/trino/blob/6697fe24481a30d37eb91efd62666165acf379c2/plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java#L3027>.
I think this is exactly what I am asking for with
"refresh-start-timestamp-ms".

Walaa, no, the suggestion is to not have a grace period as this is engine
specific on how it wants to handle staleness.
 "refresh-start-timestamp-ms" refers to 1 and not 2.  We should already
have 2 in the snapshot summary timestamp-ms property.

When I say "fresh as of", I don't mean the AS OF construct.  It's just
making a guarantee to the consumer of the MV that the materialization
contains data that is "fresh as of" a certain timestamp.  So like, if you
built a materialization on top of 100 tables (possibly a mix of Iceberg and
non-Iceberg) and you know that the refresh job ran on say 6/20/2024
12:02:10 UTC, then whatever data is in the materialization has to be "fresh
as of" 6/20/2024 12:02:10 UTC.

Thanks
Benny




On Thu, Jun 20, 2024 at 11:19 AM Walaa Eldin Moustafa <wa.moust...@gmail.com>
wrote:

> Benny, is the suggestion to couple the "refresh-start-timestamp-ms"
> property with a grace period as well? Also, could you clarify which
> timestamp "refresh-start-timestamp-ms" refers to:
> (1) Timestamp when refresh is triggered
> (2) Timestamp when refresh is concluded and the snapshot is written.
>
> Also when you say "fresh as of" this timestamp, do you mean "AS OF"
> construct when used to query the materialized view? Or something else? If
> "AS OF" is what you meant, then this might answer my question about the
> grace period, where it won't be needed.
>
> Thanks,
> Walaa.
>
>
>
> On Thu, Jun 20, 2024 at 5:22 AM Piotr Findeisen <piotr.findei...@gmail.com>
> wrote:
>
>> Hi Benny,
>>
>> on the staleness topic I'd recommend to check how Trino implements
>> materialized views in Iceberg and how it defines staleness.
>> In particular
>>
>> - a view can have defined grace period which defines how stale the data
>> can be for the materialization to be considered useful (defaults to
>> unlimited)
>> - staleness clock starts with the first table change after refresh
>> - for unmanaged (non-iceberg) tables where we don't know when the table
>> changed, the staleness clock starts right after refresh
>>
>> Best
>> Piotr
>>
>>
>>
>>
>>
>> On Wed, 19 Jun 2024 at 19:58, Benny Chow <btc...@gmail.com> wrote:
>>
>>> Hey Guys,
>>>
>>> Great progress on the MV spec and thanks a ton to Jan and Walaa for
>>> driving this.  One of our latest achievements was that we finalized the
>>> view lineage and materialization table refresh JSON so that we can
>>> definitively and concisely describe what data is in the materialization
>>> table.
>>>
>>> Regarding the actual refresh process, I have two more suggestions:
>>>
>>> *When should a MV be refreshed?  *There could be many different refresh
>>> policies such as "on table data or view change", periodic, scheduled and/or
>>> manual with the goal of reducing staleness while minimizing cost to
>>> refresh.  I don't think we should try to capture this configuration as part
>>> of the first iteration of the MV spec.  So, I suggest we just remove the "
>>> *materialization.data.max-staleness*" view property for now.  There's a
>>> lot of comments on this in the spec and many contributors did suggest to
>>> not include it as well.
>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?disco=AAABFwRPGoA
>>>
>>> *When refresh is done, what information is stored to help engines
>>> evaluate materialization freshness?*  We agreed on storing the view
>>> lineage and materialization refresh-tables so that engines can query for
>>> the current table snapshot versions and view versions and compare to what
>>> is stored in the refresh-tables.  However, there could be 100s of
>>> tables/views here and it could be prohibitively expensive to do this
>>> check.  Instead, the engine may just use the materialization's snapshot
>>> summary timestamp-ms to determine the last refresh time and assume the data
>>> is fresh as of this timestamp.  However, this assumption might be naive if
>>> the refresh job took 1 hour to run and source tables were queried at
>>> different times throughout the execution of the job.  So, I propose we add
>>> a "*refresh-start-timestamp-ms*" to the materialization snapshot
>>> summary which tells users that the data in the materialization is at least
>>> as fresh as of this date  (It might be fresher but not more stale).
>>>
>>> Thoughts?
>>>
>>> Thanks
>>> Benny
>>>
>>>
>>>
>>>
>>>
>>

Reply via email to