Tried to summarize this thread as: https://github.com/apache/iceberg/pull/8982
On Wed, Apr 26, 2023 at 1:16 PM Jack Ye <yezhao...@gmail.com> wrote: > We probably want to document these two different behaviors, and what we > think is the correct expected behavior on the website. > > The question about time travel in a branch comes quite often since the > related feature is publicly released. If some users really want the > ancestor-based behavior, it is technically possible to do it through some > joins of the refs and snapshots reference table to figure out the specific > version to travel to. We could provide some SQL examples for that. > > -Jack > > On Wed, Apr 26, 2023 at 10:34 AM Ryan Blue <b...@tabular.io> wrote: > >> > Just to make this explicit "history" here to the "snapshot-log" in the >> spec? >> >> Yes, that's correct. The snapshot log is the history of what snapshots >> was current. We don't keep history for other branches right now. I'm not >> sure that we would want to. >> >> > Given how time-travel is currently defined, this still seems doable but >> less efficient by using the "metadata-log" and opening historic files but >> probably not worth the effort >> >> Yeah, it would be doable, but I don't think it is a good idea to go >> through old metadata files. There is no guarantee that they still exist, >> and we want the current metadata file to be the source of truth for all >> Iceberg operations. >> >> I think it's probably a good idea to note that this is the expected time >> travel behavior. >> >> On Wed, Apr 26, 2023 at 8:42 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> Thanks for the replies Ryan and Amogh, >>> >>> Time travel relies on history which captures all the changes on the main >>>> table state. >>> >>> Just to make this explicit "history" here to the "snapshot-log" in the >>> spec? >>> >>> We decided the first option is easier to understand and is what people >>>> expect. That way if you're debugging an old job, you get the same version >>>> it would have read, even if there are later changes like fast-forwarding >>>> the current state to a staged snapshot after validating it, or rolling >>>> back. >>> >>> Makes sense, thanks for the context. >>> >>> For “I assume in this case users need to query the underlying Iceberg >>>> metadata to determine a snapshot of interest)?” just curious how were you >>>> planning on doing this (bearing in mind time travel relies on history)? >>> >>> Originally, I was thinking of time-travel as the second option that Ryan >>> mentioned, in which case it seemed like a metadata only operation. Given >>> how time-travel is currently defined, this still seems doable but less >>> efficient by using the "metadata-log" and opening historic files but >>> probably not worth the effort. >>> >>> Do you think it pays to add a note for implementers in the specification >>> that the "snapshot-log" (assuming I got the correct field) is what is used >>> in reference implementations for time-travel (apologies if this is already >>> covered and I missed it)? >>> >>> Thanks, >>> Micah >>> >>> >>> >>> >>> >>> On Tue, Apr 25, 2023 at 4:33 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> Everything Amogh said is correct, but I can give a bit more context. >>>> >>>> There are two options for the behavior of time travel by timestamp. >>>> First, you can read the state of the table that you _would have read_ if >>>> you ran the query at that time. Second, you could read the ancestor of the >>>> current state that was "current" at that time. >>>> >>>> We decided the first option is easier to understand and is what people >>>> expect. That way if you're debugging an old job, you get the same version >>>> it would have read, even if there are later changes like fast-forwarding >>>> the current state to a staged snapshot after validating it, or rolling >>>> back. >>>> >>>> Ryan >>>> >>>> On Tue, Apr 25, 2023 at 3:35 PM Jahagirdar, Amogh >>>> <jaham...@amazon.com.invalid> wrote: >>>> >>>>> Hi Micah, >>>>> >>>>> >>>>> >>>>> Your understanding is right, as of today there is no mechanism for >>>>> performing time travel on branch. Time travel relies on history which >>>>> captures all the changes on the main table state. At present there is no >>>>> history metadata for branches (we can’t use snapshot lineages), for more >>>>> details checkout this PR comment. >>>>> <https://github.com/apache/iceberg/pull/5364#issuecomment-1227902420> >>>>> >>>>> For “I assume in this case users need to query the underlying Iceberg >>>>> metadata to determine a snapshot of interest)?” just curious how were you >>>>> planning on doing this (bearing in mind time travel relies on history)? >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> >>>>> Amogh Jahagirdar >>>>> >>>>> >>>>> >>>>> *From: *Micah Kornfield <emkornfi...@gmail.com> >>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org> >>>>> *Date: *Tuesday, April 25, 2023 at 3:09 PM >>>>> *To: *Iceberg Dev List <dev@iceberg.apache.org> >>>>> *Subject: *[EXTERNAL] SQL Syntax for Time Travel on a Branch? >>>>> >>>>> >>>>> >>>>> *CAUTION*: This email originated from outside of the organization. Do >>>>> not click links or open attachments unless you can confirm the sender and >>>>> know the content is safe. >>>>> >>>>> >>>>> >>>>> Looking through the documents for Spark SQL syntax [1], it appears >>>>> that Iceberg supports reading a branch at the latest version or >>>>> time-travel >>>>> on the main table, but I didn't see any queries that compose the two. >>>>> >>>>> >>>>> >>>>> Is my understanding correct that there isn't existing SQL for time >>>>> travel on a specific branch (I assume in this case users need to query the >>>>> underlying Iceberg metadata to determine a snapshot of interest)? >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Micah >>>>> >>>>> >>>>> >>>>> [1] https://iceberg.apache.org/docs/latest/spark-queries/ >>>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>> >> >> -- >> Ryan Blue >> Tabular >> >