Tried to summarize this thread as:
https://github.com/apache/iceberg/pull/8982

On Wed, Apr 26, 2023 at 1:16 PM Jack Ye <yezhao...@gmail.com> wrote:

> We probably want to document these two different behaviors, and what we
> think is the correct expected behavior on the website.
>
> The question about time travel in a branch comes quite often since the
> related feature is publicly released. If some users really want the
> ancestor-based behavior, it is technically possible to do it through some
> joins of the refs and snapshots reference table to figure out the specific
> version to travel to. We could provide some SQL examples for that.
>
> -Jack
>
> On Wed, Apr 26, 2023 at 10:34 AM Ryan Blue <b...@tabular.io> wrote:
>
>> > Just to make this explicit "history" here to the "snapshot-log" in the
>> spec?
>>
>> Yes, that's correct. The snapshot log is the history of what snapshots
>> was current. We don't keep history for other branches right now. I'm not
>> sure that we would want to.
>>
>> > Given how time-travel is currently defined, this still seems doable but
>> less efficient by using the "metadata-log" and opening historic files but
>> probably not worth the effort
>>
>> Yeah, it would be doable, but I don't think it is a good idea to go
>> through old metadata files. There is no guarantee that they still exist,
>> and we want the current metadata file to be the source of truth for all
>> Iceberg operations.
>>
>> I think it's probably a good idea to note that this is the expected time
>> travel behavior.
>>
>> On Wed, Apr 26, 2023 at 8:42 AM Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>>
>>> Thanks for the replies Ryan and Amogh,
>>>
>>> Time travel relies on history which captures all the changes on the main
>>>> table state.
>>>
>>> Just to make this explicit "history" here to the "snapshot-log" in the
>>> spec?
>>>
>>> We decided the first option is easier to understand and is what people
>>>> expect. That way if you're debugging an old job, you get the same version
>>>> it would have read, even if there are later changes like fast-forwarding
>>>> the current state to a staged snapshot after validating it, or rolling 
>>>> back.
>>>
>>> Makes sense, thanks for the context.
>>>
>>> For “I assume in this case users need to query the underlying Iceberg
>>>> metadata to determine a snapshot of interest)?” just curious how were you
>>>> planning on doing this (bearing in mind time travel relies on history)?
>>>
>>> Originally, I was thinking of time-travel as the second option that Ryan
>>> mentioned, in which case it seemed like a metadata only operation.  Given
>>> how time-travel is currently defined, this still seems doable but less
>>> efficient by using the "metadata-log" and opening historic files but
>>> probably not worth the effort.
>>>
>>> Do you think it pays to add a note for implementers in the specification
>>> that the "snapshot-log" (assuming I got the correct field) is what is used
>>> in reference implementations for time-travel (apologies if this is already
>>> covered and I missed it)?
>>>
>>> Thanks,
>>> Micah
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Apr 25, 2023 at 4:33 PM Ryan Blue <b...@tabular.io> wrote:
>>>
>>>> Everything Amogh said is correct, but I can give a bit more context.
>>>>
>>>> There are two options for the behavior of time travel by timestamp.
>>>> First, you can read the state of the table that you _would have read_ if
>>>> you ran the query at that time. Second, you could read the ancestor of the
>>>> current state that was "current" at that time.
>>>>
>>>> We decided the first option is easier to understand and is what people
>>>> expect. That way if you're debugging an old job, you get the same version
>>>> it would have read, even if there are later changes like fast-forwarding
>>>> the current state to a staged snapshot after validating it, or rolling 
>>>> back.
>>>>
>>>> Ryan
>>>>
>>>> On Tue, Apr 25, 2023 at 3:35 PM Jahagirdar, Amogh
>>>> <jaham...@amazon.com.invalid> wrote:
>>>>
>>>>> Hi Micah,
>>>>>
>>>>>
>>>>>
>>>>> Your understanding is right, as of today there is no mechanism for
>>>>> performing time travel on branch. Time travel relies on history which
>>>>> captures all the changes on the main table state. At present there is no
>>>>> history metadata for branches (we can’t use snapshot lineages), for more
>>>>> details checkout this PR comment.
>>>>> <https://github.com/apache/iceberg/pull/5364#issuecomment-1227902420>
>>>>>
>>>>> For “I assume in this case users need to query the underlying Iceberg
>>>>> metadata to determine a snapshot of interest)?” just curious how were you
>>>>> planning on doing this (bearing in mind time travel relies on history)?
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>>
>>>>> Amogh Jahagirdar
>>>>>
>>>>>
>>>>>
>>>>> *From: *Micah Kornfield <emkornfi...@gmail.com>
>>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
>>>>> *Date: *Tuesday, April 25, 2023 at 3:09 PM
>>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org>
>>>>> *Subject: *[EXTERNAL] SQL Syntax for Time Travel on a Branch?
>>>>>
>>>>>
>>>>>
>>>>> *CAUTION*: This email originated from outside of the organization. Do
>>>>> not click links or open attachments unless you can confirm the sender and
>>>>> know the content is safe.
>>>>>
>>>>>
>>>>>
>>>>> Looking through the documents for Spark SQL syntax [1], it appears
>>>>> that Iceberg supports reading a branch at the latest version or 
>>>>> time-travel
>>>>> on the main table, but I didn't see any queries that compose the two.
>>>>>
>>>>>
>>>>>
>>>>> Is my understanding correct that there isn't existing SQL for time
>>>>> travel on a specific branch (I assume in this case users need to query the
>>>>> underlying Iceberg metadata to determine a snapshot of interest)?
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Micah
>>>>>
>>>>>
>>>>>
>>>>> [1] https://iceberg.apache.org/docs/latest/spark-queries/
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Tabular
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Reply via email to