Re: [DISCUSS] row timestamp proposal

Steven Wu Mon, 11 May 2026 18:30:25 -0700

Circling back on this topic, since we have consensus on the direction. It
essentially has two parts


   1. monotonic snapshot timestamp for v4 tables
   2. row timestamp inherited from snapshot timestamp for v4 tables


#1 is an isolated and small change. So I created the following PRs:
* spec: https://github.com/apache/iceberg/pull/16294
* impl: https://github.com/apache/iceberg/pull/16293

#2 is more involved and should probably be done after the v4 metadata tree (
spec <https://github.com/apache/iceberg/pull/16025> and impl
<https://github.com/orgs/apache/projects/605/views/1>) is mostly complete,
as we want to plumb inheritance through only for the v4 tables.



On Mon, Jan 26, 2026 at 10:05 AM Russell Spitzer <[email protected]>
wrote:

> Sounds good to me
>
> On Mon, Jan 26, 2026 at 11:59 AM Anton Okolnychyi <[email protected]>
> wrote:
>
>> Cool, sounds like a plan then? Thanks for answering all the questions,
>> Steven!
>>
>> чт, 22 січ. 2026 р. о 18:29 Steven Wu <[email protected]> пише:
>>
>>> For row timestamp inheritance to work, I would need to implement the
>>> plumbing. So I would imagine existing rows would have null values because
>>> the inheritance plumbing was not there yet. This would be consistent with
>>> upgrade behavior for the V3 row lineage:
>>> https://iceberg.apache.org/spec/#row-lineage-for-upgraded-tables.
>>>
>>> On Thu, Jan 22, 2026 at 4:09 PM Anton Okolnychyi <[email protected]>
>>> wrote:
>>>
>>>> Also, do we have a concrete plan for how to handle tables that would be
>>>> upgraded to V4? What timestamp will we assign to existing rows?
>>>>
>>>> On Wed, Jan 21, 2026 at 3:59 PM Anton Okolnychyi <[email protected]>
>>>> wrote:
>>>>
>>>>> If we ignore temporal queries that need strict snapshot boundaries and
>>>>> can't be solved completely using row timestamps in case of mutations, you
>>>>> mentioned other use cases when row timestamps may be helpful like TTL and
>>>>> auditing. We can debate whether using CURRENT_TIMESTAMP() is enough for
>>>>> them, but I don't really see a point given that we already have row 
>>>>> lineage
>>>>> in V3 and the storage overhead for one more field isn't likely to be
>>>>> noticable. One of the problems with CURRENT_TIMESTAMP() is the required
>>>>> action by the user. Having a reliable row timestamp populated 
>>>>> automatically
>>>>> is likely to be better, so +1.
>>>>>
>>>>> пт, 16 січ. 2026 р. о 14:30 Steven Wu <[email protected]> пише:
>>>>>
>>>>>> Joining with snapshot history also has significant complexity. It
>>>>>> requires retaining the entire snapshot history with probably trimmed
>>>>>> snapshot metadata. There are concerns on the size of the snapshot history
>>>>>> for tables with frequent commits (like streaming ingestion). Do we 
>>>>>> maintain
>>>>>> the unbounded trimmed snapshot history in the same table metadata, which
>>>>>> could affect table metadata.json size? or store it separately somewhere
>>>>>> (like in catalog), which would require the complexity of multi-entity
>>>>>> transaction in catalog?
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I've gone back and forth on the inherited columns. I think the thing
>>>>>>> which keeps coming back to me is that I don't
>>>>>>> like that the only way to determine the timestamp associated with a
>>>>>>> row update/creation is to do a join back
>>>>>>> against table metadata. While that's doable, It feels user
>>>>>>> unfriendly.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Anton, you are right that the row-level deletes will be a problem
>>>>>>>> for some of the mentioned use cases (like incremental processing). I 
>>>>>>>> have
>>>>>>>> clarified the applicability of some use cases to "tables with inserts 
>>>>>>>> and
>>>>>>>> updates only".
>>>>>>>>
>>>>>>>> Right now, we are only tracking modification/commit time (not
>>>>>>>> insertion time) in case of updates.
>>>>>>>>
>>>>>>>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I think there is clear consensus that making snapshot timestamps
>>>>>>>>> strictly increasing is a positive thing. I am also +1.
>>>>>>>>>
>>>>>>>>> - How will row timestamps allow us to reliably implement
>>>>>>>>> incremental consumption independent of the snapshot retention given 
>>>>>>>>> that
>>>>>>>>> rows can be added AND removed in a particular time frame? How can we
>>>>>>>>> capture all changes by just looking at the latest snapshot?
>>>>>>>>> - Some use cases in the doc need the insertion time and some need
>>>>>>>>> the last modification time. Do we plan to support both?
>>>>>>>>> - What do we expect the behavior to be in UPDATE and MERGE
>>>>>>>>> operations?
>>>>>>>>>
>>>>>>>>> To be clear: I am not opposed to this change, just want to make
>>>>>>>>> sure I understand all use cases that we aim to address and what would 
>>>>>>>>> be
>>>>>>>>> required in engines.
>>>>>>>>>
>>>>>>>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <
>>>>>>>>> [email protected]> пише:
>>>>>>>>>
>>>>>>>>>> +1 for improving how the commit timestamps are
>>>>>>>>>> assigned monotonically since this requirement has emerged over 
>>>>>>>>>> multiple
>>>>>>>>>> discussions like notifications, multi-table transactions, time travel
>>>>>>>>>> accuracy and row timestamps. It would be good to have a single 
>>>>>>>>>> consistent
>>>>>>>>>> way to represent and assign timestamps that could be leveraged across
>>>>>>>>>> multiple features.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yeah, to add my perspective on that discussion, I think my
>>>>>>>>>>> primary concern is that people expect timestamps to be monotonic 
>>>>>>>>>>> and if
>>>>>>>>>>> they aren't then a `_last_update_timestamp` field just makes the 
>>>>>>>>>>> problem
>>>>>>>>>>> worse. But it is _nice_ to have row-level timestamps. So I would be 
>>>>>>>>>>> okay if
>>>>>>>>>>> we revisit how we assign commit timestamps and improve it so that 
>>>>>>>>>>> you get
>>>>>>>>>>> monotonic behavior.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> We had an offline discussion with Ryan. I revised the proposal
>>>>>>>>>>>> as follows.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. V4 would require writers to generate *monotonic* snapshot
>>>>>>>>>>>> timestamps. The proposal doc has a section that describes a 
>>>>>>>>>>>> recommended
>>>>>>>>>>>> implementation using lamport timestamps.
>>>>>>>>>>>> 2. Expose *last_update_timestamp* metadata column that
>>>>>>>>>>>> inherits from snapshot timestamp
>>>>>>>>>>>>
>>>>>>>>>>>> This is a relatively low-friction change that can fix the time
>>>>>>>>>>>> travel problem and enable use cases like latency tracking, 
>>>>>>>>>>>> temporal query,
>>>>>>>>>>>> TTL, auditing.
>>>>>>>>>>>>
>>>>>>>>>>>> There is no accuracy requirement on the timestamp values. In
>>>>>>>>>>>> practice, modern servers with NTP have pretty reliable wall 
>>>>>>>>>>>> clocks. E.g.,
>>>>>>>>>>>> Java library implemented this validation
>>>>>>>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>>>>>>>>>>>  that
>>>>>>>>>>>> protects against backward clock drift up to one minute for snapshot
>>>>>>>>>>>> timestamps. Don't think we have heard many complaints of commit 
>>>>>>>>>>>> failure due
>>>>>>>>>>>> to that clock drift validation.
>>>>>>>>>>>>
>>>>>>>>>>>> Would appreciate feedback on the revised proposal.
>>>>>>>>>>>>
>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Steven
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP()
>>>>>>>>>>>>> is usually evaluated quite early in engines so we could 
>>>>>>>>>>>>> theoretically have
>>>>>>>>>>>>> another expression closer to the commit time. You are right, 
>>>>>>>>>>>>> though, it
>>>>>>>>>>>>> won't be the actual commit time given that we have to write it 
>>>>>>>>>>>>> into the
>>>>>>>>>>>>> files. Also, I don't think generating a timestamp for a row as it 
>>>>>>>>>>>>> is being
>>>>>>>>>>>>> written is going to be beneficial. To sum up, expression-based 
>>>>>>>>>>>>> defaults
>>>>>>>>>>>>> would allow us to capture the time the transaction or write 
>>>>>>>>>>>>> starts, but not
>>>>>>>>>>>>> the actual commit time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Russell, if the goal is to know what happened to the table in
>>>>>>>>>>>>> a given time frame, isn't the changelog scan the way to go? It 
>>>>>>>>>>>>> would assign
>>>>>>>>>>>>> commit ordinals based on lineage and include row-level diffs. How 
>>>>>>>>>>>>> would you
>>>>>>>>>>>>> be able to determine changes with row timestamps by just looking 
>>>>>>>>>>>>> at the
>>>>>>>>>>>>> latest snapshot?
>>>>>>>>>>>>>
>>>>>>>>>>>>> It does seem promising to make snapshot timestamps strictly
>>>>>>>>>>>>> increasing to avoid ambiguity during time travel.
>>>>>>>>>>>>>
>>>>>>>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> > Whether or not "t" is an atomic clock time is not as
>>>>>>>>>>>>>> important as the query between time bounds making sense.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm not sure I get it then. If we want monotonically
>>>>>>>>>>>>>> increasing times, but they don't have to be real times then how 
>>>>>>>>>>>>>> do you know
>>>>>>>>>>>>>> what notion of "time" you care about for these filters? Or to 
>>>>>>>>>>>>>> put it
>>>>>>>>>>>>>> another way, how do you know that your "before" and "after" 
>>>>>>>>>>>>>> times are
>>>>>>>>>>>>>> reasonable? If the boundaries of these time queries can move 
>>>>>>>>>>>>>> around a bit,
>>>>>>>>>>>>>> by how much?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems to me that row IDs can play an important role here
>>>>>>>>>>>>>> because you have the order guarantee that we seem to want for 
>>>>>>>>>>>>>> this use
>>>>>>>>>>>>>> case: if snapshot A was committed before snapshot B, then the 
>>>>>>>>>>>>>> rows from A
>>>>>>>>>>>>>> have row IDs that are always less than the rows IDs of B. The 
>>>>>>>>>>>>>> problem is
>>>>>>>>>>>>>> that we don't know where those row IDs start and end once A and 
>>>>>>>>>>>>>> B are no
>>>>>>>>>>>>>> longer tracked. Using a "timestamp" seems to work, but I still 
>>>>>>>>>>>>>> worry that
>>>>>>>>>>>>>> without reliable timestamps that correspond with some guarantee 
>>>>>>>>>>>>>> to real
>>>>>>>>>>>>>> timestamps, we are creating a feature that seems reliable but 
>>>>>>>>>>>>>> isn't.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm somewhat open to the idea of introducing a snapshot
>>>>>>>>>>>>>> timestamp that the catalog guarantees is monotonically 
>>>>>>>>>>>>>> increasing. But if
>>>>>>>>>>>>>> we did that, wouldn't we still need to know the association 
>>>>>>>>>>>>>> between these
>>>>>>>>>>>>>> timestamps and snapshots after the snapshot metadata expires? My 
>>>>>>>>>>>>>> mental
>>>>>>>>>>>>>> model is that this would be used to look for data that arrived, 
>>>>>>>>>>>>>> say, 3
>>>>>>>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer 
>>>>>>>>>>>>>> around we
>>>>>>>>>>>>>> could use the row timestamp to find those rows. But how do we 
>>>>>>>>>>>>>> know that the
>>>>>>>>>>>>>> snapshot timestamps correspond to the actual timestamp range of 
>>>>>>>>>>>>>> Dec 24th?
>>>>>>>>>>>>>> Is it just "close enough" as long as we don't have out of order 
>>>>>>>>>>>>>> timestamps?
>>>>>>>>>>>>>> This is what I mean by needing to keep track of the association 
>>>>>>>>>>>>>> between
>>>>>>>>>>>>>> timestamps and snapshots after the metadata expires. Seems like 
>>>>>>>>>>>>>> you either
>>>>>>>>>>>>>> need to keep track of what the catalog's clock was for events 
>>>>>>>>>>>>>> you care
>>>>>>>>>>>>>> about, or you don't really care about exact timestamps.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The key goal here is the ability to answer the question
>>>>>>>>>>>>>>> "what happened to the table in some time window. (before < t < 
>>>>>>>>>>>>>>> after)?"
>>>>>>>>>>>>>>> Whether or not "t" is an atomic clock time is not as
>>>>>>>>>>>>>>> important as the query between time bounds making sense.
>>>>>>>>>>>>>>> Downstream applications (from what I know) are mostly
>>>>>>>>>>>>>>> sensitive to getting discrete and well defined answers to
>>>>>>>>>>>>>>> this question like:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1 < t < 2 should be exclusive of
>>>>>>>>>>>>>>> 2 < t < 3 should be exclusive of
>>>>>>>>>>>>>>> 3 < t < 4
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And the union of these should be the same as the query
>>>>>>>>>>>>>>> asking for 1 < t < 4
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently this is not possible because we have no
>>>>>>>>>>>>>>> guarantee of ordering in our timestamps
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Snapshots
>>>>>>>>>>>>>>> A -> B -> C
>>>>>>>>>>>>>>> Sequence numbers
>>>>>>>>>>>>>>> 50 -> 51 ->  52
>>>>>>>>>>>>>>> Timestamp
>>>>>>>>>>>>>>> 3 -> 1 -> 2
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This makes time travel always a little wrong to start with.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The Java implementation only allows one minute of negative
>>>>>>>>>>>>>>> time on commit so we actually kind of do have this as a
>>>>>>>>>>>>>>> "light monotonicity" requirement but as noted above there is
>>>>>>>>>>>>>>> no spec requirement for this.  While we do have sequence
>>>>>>>>>>>>>>> number and row id, we still don't have a stable way of
>>>>>>>>>>>>>>> associating these with a consistent time in an engine 
>>>>>>>>>>>>>>> independent way.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ideally we just want to have one consistent way of answering
>>>>>>>>>>>>>>> the question "what did the table look like at time t"
>>>>>>>>>>>>>>> which I think we get by adding in a new field that is a
>>>>>>>>>>>>>>> timestamp, set by the Catalog close to commit time,
>>>>>>>>>>>>>>> that always goes up.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm not sure we can really do this with an engine expression
>>>>>>>>>>>>>>> since they won't know when the data is actually committed
>>>>>>>>>>>>>>> when writing files?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This seems like a lot of new complexity in the format. I
>>>>>>>>>>>>>>>> would like us to explore whether we can build the considered 
>>>>>>>>>>>>>>>> use cases on
>>>>>>>>>>>>>>>> top of expression-based defaults instead.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar
>>>>>>>>>>>>>>>> functions that are part of the SQL standard definition for 
>>>>>>>>>>>>>>>> default values.
>>>>>>>>>>>>>>>> This would provide us a way to know the relative row order. 
>>>>>>>>>>>>>>>> True, this
>>>>>>>>>>>>>>>> usually will represent the start of the operation. We may 
>>>>>>>>>>>>>>>> define
>>>>>>>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual 
>>>>>>>>>>>>>>>> commit time, if
>>>>>>>>>>>>>>>> there are use cases that need that. Plus, we may explore an 
>>>>>>>>>>>>>>>> approach
>>>>>>>>>>>>>>>> similar to MySQL that allows users to reset the default value 
>>>>>>>>>>>>>>>> on update.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - Anton
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think this is the right step forward. Our current
>>>>>>>>>>>>>>>>> "timestamp" definition is too ambiguous to be useful so 
>>>>>>>>>>>>>>>>> establishing
>>>>>>>>>>>>>>>>> a well defined and monotonic timestamp could be really
>>>>>>>>>>>>>>>>> great. I also like the ability for row's to know this value 
>>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>>> having to rely on snapshot information which can be
>>>>>>>>>>>>>>>>> expired.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have revised the row timestamp proposal with the
>>>>>>>>>>>>>>>>>> following changes.
>>>>>>>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that
>>>>>>>>>>>>>>>>>> has nano-second precision.
>>>>>>>>>>>>>>>>>> * this optional field is only set by the REST catalog
>>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using
>>>>>>>>>>>>>>>>>> Lamport timestamp)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> For long-running streaming jobs that commit
>>>>>>>>>>>>>>>>>>> periodically, it is difficult to establish the constant 
>>>>>>>>>>>>>>>>>>> value of
>>>>>>>>>>>>>>>>>>> current_timestamp across all writer tasks for each commit 
>>>>>>>>>>>>>>>>>>> cycle. I guess
>>>>>>>>>>>>>>>>>>> streaming writers may just need to write the wall clock 
>>>>>>>>>>>>>>>>>>> time when appending
>>>>>>>>>>>>>>>>>>> a row to a data file for the default value of 
>>>>>>>>>>>>>>>>>>> current_timestamp.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I don't think that every row would have a different
>>>>>>>>>>>>>>>>>>>> value. That would be up to the engine, but I would expect 
>>>>>>>>>>>>>>>>>>>> engines to insert
>>>>>>>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with 
>>>>>>>>>>>>>>>>>>>> a constant,
>>>>>>>>>>>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> You're right that this would not necessarily be the
>>>>>>>>>>>>>>>>>>>> commit time. But neither is the commit timestamp from 
>>>>>>>>>>>>>>>>>>>> Iceberg's snapshot.
>>>>>>>>>>>>>>>>>>>> I'm not sure how we are going to define "good enough" for 
>>>>>>>>>>>>>>>>>>>> this purpose. I
>>>>>>>>>>>>>>>>>>>> think at least `CURRENT_TIMESTAMP` has reliable and known 
>>>>>>>>>>>>>>>>>>>> behavior when you
>>>>>>>>>>>>>>>>>>>> look at how it is handled in engines. And if you want the 
>>>>>>>>>>>>>>>>>>>> Iceberg
>>>>>>>>>>>>>>>>>>>> timestamp, then use a periodic query of the snapshot 
>>>>>>>>>>>>>>>>>>>> stable to keep track
>>>>>>>>>>>>>>>>>>>> of them in a table you can join to. I don't think this 
>>>>>>>>>>>>>>>>>>>> rises to the need
>>>>>>>>>>>>>>>>>>>> for a table feature unless we can guarantee that it is 
>>>>>>>>>>>>>>>>>>>> correct.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the
>>>>>>>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the same 
>>>>>>>>>>>>>>>>>>>>> semantic to
>>>>>>>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should have 
>>>>>>>>>>>>>>>>>>>>> the same timestamp
>>>>>>>>>>>>>>>>>>>>> value?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT
>>>>>>>>>>>>>>>>>>>>> CURRENT_TIMESTAMP)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the
>>>>>>>>>>>>>>>>>>>>> transaction start time, all rows added in the same insert 
>>>>>>>>>>>>>>>>>>>>> transaction would
>>>>>>>>>>>>>>>>>>>>> have the same value as the transaction timestamp with the 
>>>>>>>>>>>>>>>>>>>>> column
>>>>>>>>>>>>>>>>>>>>> definition above.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows
>>>>>>>>>>>>>>>>>>>>> added in the same Iceberg transaction/snapshot should 
>>>>>>>>>>>>>>>>>>>>> have the same
>>>>>>>>>>>>>>>>>>>>> timestamp?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Ryan, I understand your comment for using
>>>>>>>>>>>>>>>>>>>>> current_timestamp expression as column default value, you 
>>>>>>>>>>>>>>>>>>>>> were thinking
>>>>>>>>>>>>>>>>>>>>> that the engine would set the column value to the wall 
>>>>>>>>>>>>>>>>>>>>> clock time when
>>>>>>>>>>>>>>>>>>>>> appending a row to a data file, right? every row would 
>>>>>>>>>>>>>>>>>>>>> almost have a
>>>>>>>>>>>>>>>>>>>>> different timestamp value.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry
>>>>>>>>>>>>>>>>>>>>>> the right semantic for the use cases. E.g., latency 
>>>>>>>>>>>>>>>>>>>>>> tracking is interested
>>>>>>>>>>>>>>>>>>>>>> in when records are added / committed to the table, not 
>>>>>>>>>>>>>>>>>>>>>> when the record was
>>>>>>>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing 
>>>>>>>>>>>>>>>>>>>>>> engine.
>>>>>>>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or 
>>>>>>>>>>>>>>>>>>>>>> even hours apart.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has
>>>>>>>>>>>>>>>>>>>>>> no overhead with the initial commit and has very minimal 
>>>>>>>>>>>>>>>>>>>>>> storage overhead
>>>>>>>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would 
>>>>>>>>>>>>>>>>>>>>>> have distinct values
>>>>>>>>>>>>>>>>>>>>>> for every row and has more storage overhead.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> OLTP databases deal with small row-level
>>>>>>>>>>>>>>>>>>>>>> transactions. Postgres `current_timestamp` captures the 
>>>>>>>>>>>>>>>>>>>>>> transaction start
>>>>>>>>>>>>>>>>>>>>>> time [1, 2]. Should we extend the same semantic to 
>>>>>>>>>>>>>>>>>>>>>> Iceberg: all rows added
>>>>>>>>>>>>>>>>>>>>>> in the same snapshot should have the same timestamp 
>>>>>>>>>>>>>>>>>>>>>> value?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this
>>>>>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need
>>>>>>>>>>>>>>>>>>>>>>>> to implement the following row lineage behavior
>>>>>>>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be
>>>>>>>>>>>>>>>>>>>>>>>> persisted into data files.
>>>>>>>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from
>>>>>>>>>>>>>>>>>>>>>>>> the previous row. This is similar to the row_id carry 
>>>>>>>>>>>>>>>>>>>>>>>> over for row updates.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>>>>>>>>>>>> 3.  Updated by - The system that last updated the
>>>>>>>>>>>>>>>>>>>>>>> record.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might
>>>>>>>>>>>>>>>>>>>>>>> not have utility.  I'm just wondering if there will be 
>>>>>>>>>>>>>>>>>>>>>>> more example like
>>>>>>>>>>>>>>>>>>>>>>> this in the future.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> created_by column would incur likely significantly
>>>>>>>>>>>>>>>>>>>>>>>> higher storage overhead compared to the updated_by 
>>>>>>>>>>>>>>>>>>>>>>>> column. As rows are
>>>>>>>>>>>>>>>>>>>>>>>> updated overtime, the cardinality for this column in 
>>>>>>>>>>>>>>>>>>>>>>>> data files can be
>>>>>>>>>>>>>>>>>>>>>>>> high. Hence, the created_by column may not compress 
>>>>>>>>>>>>>>>>>>>>>>>> well. This is a similar
>>>>>>>>>>>>>>>>>>>>>>>> problem for the row_id column. One side effect of 
>>>>>>>>>>>>>>>>>>>>>>>> enabling row lineage by
>>>>>>>>>>>>>>>>>>>>>>>> default for V3 tables is the storage overhead of 
>>>>>>>>>>>>>>>>>>>>>>>> row_id column after
>>>>>>>>>>>>>>>>>>>>>>>> compaction especially for narrow tables with few 
>>>>>>>>>>>>>>>>>>>>>>>> columns.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I agree.  I think this analysis also shows that some
>>>>>>>>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have 
>>>>>>>>>>>>>>>>>>>>>>> all these columns,
>>>>>>>>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than 
>>>>>>>>>>>>>>>>>>>>>>> mandating them for
>>>>>>>>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like 
>>>>>>>>>>>>>>>>>>>>>>> it would solve the
>>>>>>>>>>>>>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require 
>>>>>>>>>>>>>>>>>>>>>>>> an explicit column in
>>>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto 
>>>>>>>>>>>>>>>>>>>>>>>> set the column value.
>>>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to 
>>>>>>>>>>>>>>>>>>>>>>>> the trigger timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact
>>>>>>>>>>>>>>>>>>>>>>>> timestamp, this seems like the best solution to get 
>>>>>>>>>>>>>>>>>>>>>>>> what people want (an
>>>>>>>>>>>>>>>>>>>>>>>> insertion timestamp) that has clear and well-defined 
>>>>>>>>>>>>>>>>>>>>>>>> behavior. Since
>>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it 
>>>>>>>>>>>>>>>>>>>>>>>> makes sense to me that
>>>>>>>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe
>>>>>>>>>>>>>>>>>>>>>>>> he'll jump in on this thread. I think that we may need 
>>>>>>>>>>>>>>>>>>>>>>>> to extend default
>>>>>>>>>>>>>>>>>>>>>>>> values to include default value expressions, like 
>>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp` that
>>>>>>>>>>>>>>>>>>>>>>>> is allowed by the SQL spec. That would solve the 
>>>>>>>>>>>>>>>>>>>>>>>> problem as well as some
>>>>>>>>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and 
>>>>>>>>>>>>>>>>>>>>>>>> would not create a
>>>>>>>>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp 
>>>>>>>>>>>>>>>>>>>>>>>> feature in the format.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock
>>>>>>>>>>>>>>>>>>>>>>>> service, like Spanner TrueTime service.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder
>>>>>>>>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't 
>>>>>>>>>>>>>>>>>>>>>>>> think that we should be
>>>>>>>>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that 
>>>>>>>>>>>>>>>>>>>>>>>> people might mistake as
>>>>>>>>>>>>>>>>>>>>>>>> having those guarantees.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we
>>>>>>>>>>>>>>>>>>>>>>>>> are not proposing using timestamps for ordering. With 
>>>>>>>>>>>>>>>>>>>>>>>>> NTP in modern
>>>>>>>>>>>>>>>>>>>>>>>>> computers, they are generally reliable enough for the 
>>>>>>>>>>>>>>>>>>>>>>>>> intended use cases.
>>>>>>>>>>>>>>>>>>>>>>>>> Also some environments may have stronger clock 
>>>>>>>>>>>>>>>>>>>>>>>>> service, like Spanner
>>>>>>>>>>>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots
>>>>>>>>>>>>>>>>>>>>>>>>> metadata table.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot
>>>>>>>>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due 
>>>>>>>>>>>>>>>>>>>>>>>>> to performance reasons.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like
>>>>>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an 
>>>>>>>>>>>>>>>>>>>>>>>>> implicit one from table
>>>>>>>>>>>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require 
>>>>>>>>>>>>>>>>>>>>>>>>> an explicit column in
>>>>>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto 
>>>>>>>>>>>>>>>>>>>>>>>>> set the column value.
>>>>>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to 
>>>>>>>>>>>>>>>>>>>>>>>>> the trigger timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like
>>>>>>>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch 
>>>>>>>>>>>>>>>>>>>>>>>>> computation) doesn't capture the
>>>>>>>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in 
>>>>>>>>>>>>>>>>>>>>>>>>> a batch fashion.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a
>>>>>>>>>>>>>>>>>>>>>>>>> longer history of snapshot timestamps, like storing a 
>>>>>>>>>>>>>>>>>>>>>>>>> catalog's event log
>>>>>>>>>>>>>>>>>>>>>>>>> for long-term access to timestamp info
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the
>>>>>>>>>>>>>>>>>>>>>>>>> regular table query with catalog event log. I would 
>>>>>>>>>>>>>>>>>>>>>>>>> also imagine catalog
>>>>>>>>>>>>>>>>>>>>>>>>> event log is capped at shorter retention (maybe a few 
>>>>>>>>>>>>>>>>>>>>>>>>> months) compared to
>>>>>>>>>>>>>>>>>>>>>>>>> data retention (could be a few years).
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose
>>>>>>>>>>>>>>>>>>>>>>>>>> timestamps at the row level. Timestamps in metadata 
>>>>>>>>>>>>>>>>>>>>>>>>>> that would be carried
>>>>>>>>>>>>>>>>>>>>>>>>>> down to the row level already confuse people that 
>>>>>>>>>>>>>>>>>>>>>>>>>> expect them to be useful
>>>>>>>>>>>>>>>>>>>>>>>>>> or reliable, rather than for debugging. I think 
>>>>>>>>>>>>>>>>>>>>>>>>>> extending this to the row
>>>>>>>>>>>>>>>>>>>>>>>>>> level would only make the problem worse.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> You can already get this information by
>>>>>>>>>>>>>>>>>>>>>>>>>> projecting the last updated sequence number, which 
>>>>>>>>>>>>>>>>>>>>>>>>>> is reliable, and joining
>>>>>>>>>>>>>>>>>>>>>>>>>> to timestamps from the snapshots metadata table. Of 
>>>>>>>>>>>>>>>>>>>>>>>>>> course, the drawback
>>>>>>>>>>>>>>>>>>>>>>>>>> there is losing the timestamp information when 
>>>>>>>>>>>>>>>>>>>>>>>>>> snapshots expire, but since
>>>>>>>>>>>>>>>>>>>>>>>>>> it isn't reliable anyway I'd be fine with that.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and
>>>>>>>>>>>>>>>>>>>>>>>>>> compliance, are probably better served by embedding 
>>>>>>>>>>>>>>>>>>>>>>>>>> a timestamp in DML
>>>>>>>>>>>>>>>>>>>>>>>>>> (like `current_timestamp`) rather than relying on an 
>>>>>>>>>>>>>>>>>>>>>>>>>> implicit one from
>>>>>>>>>>>>>>>>>>>>>>>>>> table metadata. And for those use cases, you could 
>>>>>>>>>>>>>>>>>>>>>>>>>> also keep a longer
>>>>>>>>>>>>>>>>>>>>>>>>>> history of snapshot timestamps, like storing a 
>>>>>>>>>>>>>>>>>>>>>>>>>> catalog's event log for
>>>>>>>>>>>>>>>>>>>>>>>>>> long-term access to timestamp info. I think that 
>>>>>>>>>>>>>>>>>>>>>>>>>> would be better than
>>>>>>>>>>>>>>>>>>>>>>>>>> storing it at the row level.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to
>>>>>>>>>>>>>>>>>>>>>>>>>>> expose the row timestamp concept that can help with 
>>>>>>>>>>>>>>>>>>>>>>>>>>> many use cases like
>>>>>>>>>>>>>>>>>>>>>>>>>>> temporal queries, latency tracking, TTL, auditing 
>>>>>>>>>>>>>>>>>>>>>>>>>>> and compliance.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata
>>>>>>>>>>>>>>>>>>>>>>>>>>> column behaves very similarly to the
>>>>>>>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot
>>>>>>>>>>>>>>>>>>>>>>>>>>>    timestamp.
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its
>>>>>>>>>>>>>>>>>>>>>>>>>>>    values are persisted in the data files.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to