Re: SQL Syntax for Time Travel on a Branch?

Jack Ye Wed, 26 Apr 2023 13:15:53 -0700

We probably want to document these two different behaviors, and what we
think is the correct expected behavior on the website.


The question about time travel in a branch comes quite often since the
related feature is publicly released. If some users really want the
ancestor-based behavior, it is technically possible to do it through some
joins of the refs and snapshots reference table to figure out the specific
version to travel to. We could provide some SQL examples for that.

-Jack

On Wed, Apr 26, 2023 at 10:34 AM Ryan Blue <b...@tabular.io> wrote:

> > Just to make this explicit "history" here to the "snapshot-log" in the
> spec?
>
> Yes, that's correct. The snapshot log is the history of what snapshots was
> current. We don't keep history for other branches right now. I'm not sure
> that we would want to.
>
> > Given how time-travel is currently defined, this still seems doable but
> less efficient by using the "metadata-log" and opening historic files but
> probably not worth the effort
>
> Yeah, it would be doable, but I don't think it is a good idea to go
> through old metadata files. There is no guarantee that they still exist,
> and we want the current metadata file to be the source of truth for all
> Iceberg operations.
>
> I think it's probably a good idea to note that this is the expected time
> travel behavior.
>
> On Wed, Apr 26, 2023 at 8:42 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Thanks for the replies Ryan and Amogh,
>>
>> Time travel relies on history which captures all the changes on the main
>>> table state.
>>
>> Just to make this explicit "history" here to the "snapshot-log" in the
>> spec?
>>
>> We decided the first option is easier to understand and is what people
>>> expect. That way if you're debugging an old job, you get the same version
>>> it would have read, even if there are later changes like fast-forwarding
>>> the current state to a staged snapshot after validating it, or rolling back.
>>
>> Makes sense, thanks for the context.
>>
>> For “I assume in this case users need to query the underlying Iceberg
>>> metadata to determine a snapshot of interest)?” just curious how were you
>>> planning on doing this (bearing in mind time travel relies on history)?
>>
>> Originally, I was thinking of time-travel as the second option that Ryan
>> mentioned, in which case it seemed like a metadata only operation.  Given
>> how time-travel is currently defined, this still seems doable but less
>> efficient by using the "metadata-log" and opening historic files but
>> probably not worth the effort.
>>
>> Do you think it pays to add a note for implementers in the specification
>> that the "snapshot-log" (assuming I got the correct field) is what is used
>> in reference implementations for time-travel (apologies if this is already
>> covered and I missed it)?
>>
>> Thanks,
>> Micah
>>
>>
>>
>>
>>
>> On Tue, Apr 25, 2023 at 4:33 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Everything Amogh said is correct, but I can give a bit more context.
>>>
>>> There are two options for the behavior of time travel by timestamp.
>>> First, you can read the state of the table that you _would have read_ if
>>> you ran the query at that time. Second, you could read the ancestor of the
>>> current state that was "current" at that time.
>>>
>>> We decided the first option is easier to understand and is what people
>>> expect. That way if you're debugging an old job, you get the same version
>>> it would have read, even if there are later changes like fast-forwarding
>>> the current state to a staged snapshot after validating it, or rolling back.
>>>
>>> Ryan
>>>
>>> On Tue, Apr 25, 2023 at 3:35 PM Jahagirdar, Amogh
>>> <jaham...@amazon.com.invalid> wrote:
>>>
>>>> Hi Micah,
>>>>
>>>>
>>>>
>>>> Your understanding is right, as of today there is no mechanism for
>>>> performing time travel on branch. Time travel relies on history which
>>>> captures all the changes on the main table state. At present there is no
>>>> history metadata for branches (we can’t use snapshot lineages), for more
>>>> details checkout this PR comment.
>>>> <https://github.com/apache/iceberg/pull/5364#issuecomment-1227902420>
>>>>
>>>> For “I assume in this case users need to query the underlying Iceberg
>>>> metadata to determine a snapshot of interest)?” just curious how were you
>>>> planning on doing this (bearing in mind time travel relies on history)?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>>
>>>> Amogh Jahagirdar
>>>>
>>>>
>>>>
>>>> *From: *Micah Kornfield <emkornfi...@gmail.com>
>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
>>>> *Date: *Tuesday, April 25, 2023 at 3:09 PM
>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org>
>>>> *Subject: *[EXTERNAL] SQL Syntax for Time Travel on a Branch?
>>>>
>>>>
>>>>
>>>> *CAUTION*: This email originated from outside of the organization. Do
>>>> not click links or open attachments unless you can confirm the sender and
>>>> know the content is safe.
>>>>
>>>>
>>>>
>>>> Looking through the documents for Spark SQL syntax [1], it appears that
>>>> Iceberg supports reading a branch at the latest version or time-travel on
>>>> the main table, but I didn't see any queries that compose the two.
>>>>
>>>>
>>>>
>>>> Is my understanding correct that there isn't existing SQL for time
>>>> travel on a specific branch (I assume in this case users need to query the
>>>> underlying Iceberg metadata to determine a snapshot of interest)?
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Micah
>>>>
>>>>
>>>>
>>>> [1] https://iceberg.apache.org/docs/latest/spark-queries/
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: SQL Syntax for Time Travel on a Branch?

Reply via email to