Re: [DISCUSS] row timestamp proposal

Steven Wu Fri, 12 Dec 2025 14:37:40 -0800

Thanks for the clarification, Ryan.

For long-running streaming jobs that commit periodically, it is difficult
to establish the constant value of current_timestamp across all writer
tasks for each commit cycle. I guess streaming writers may just need to
write the wall clock time when appending a row to a data file for the
default value of current_timestamp.




On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]> wrote:

> I don't think that every row would have a different value. That would be
> up to the engine, but I would expect engines to insert `CURRENT_TIMESTAMP`
> into the plan and then replace it with a constant, resulting in a
> consistent value for all rows.
>
> You're right that this would not necessarily be the commit time. But
> neither is the commit timestamp from Iceberg's snapshot. I'm not sure how
> we are going to define "good enough" for this purpose. I think at least
> `CURRENT_TIMESTAMP` has reliable and known behavior when you look at how it
> is handled in engines. And if you want the Iceberg timestamp, then use a
> periodic query of the snapshot stable to keep track of them in a table you
> can join to. I don't think this rises to the need for a table feature
> unless we can guarantee that it is correct.
>
> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <[email protected]> wrote:
>
>> > Postgres `current_timestamp` captures the transaction start time [1,
>> 2]. Should we extend the same semantic to Iceberg: all rows added in the
>> same snapshot should have the same timestamp value?
>>
>> Let me clarify my last comment.
>>
>> created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP)
>>
>> Since Postgres current_timestamp captures the transaction start time, all
>> rows added in the same insert transaction would have the same value as the
>> transaction timestamp with the column definition above.
>>
>> If we extend a similar semantic to Iceberg, all rows added in the same
>> Iceberg transaction/snapshot should have the same timestamp?
>>
>> Ryan, I understand your comment for using current_timestamp expression as
>> column default value, you were thinking that the engine would set the
>> column value to the wall clock time when appending a row to a data file,
>> right? every row would almost have a different timestamp value.
>>
>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <[email protected]> wrote:
>>
>>>
>>> `current_timestamp` expression may not always carry the right semantic
>>> for the use cases. E.g., latency tracking is interested in when records are
>>> added / committed to the table, not when the record was appended to an
>>> uncommitted data file in the processing engine. Record creation and Iceberg
>>> commit can be minutes or even hours apart.
>>>
>>> Row timestamp inherited from snapshot timestamp has no overhead with the
>>> initial commit and has very minimal storage overhead during file rewrite.
>>> Per-row current_timestamp would have distinct values for every row and has
>>> more storage overhead.
>>>
>>> OLTP databases deal with small row-level transactions. Postgres
>>> `current_timestamp` captures the transaction start time [1, 2]. Should we
>>> extend the same semantic to Iceberg: all rows added in the same snapshot
>>> should have the same timestamp value?
>>>
>>> [1] https://www.postgresql.org/docs/current/functions-datetime.html
>>> [2]
>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>
>>>
>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <[email protected]>
>>> wrote:
>>>
>>>>
>>>>> Micah, are 1 and 2 the same? 3 is covered by this proposal.
>>>>> To support the created_by timestamp, we would need to implement the
>>>>> following row lineage behavior
>>>>> * Initially, it inherits from the snapshot timestamp
>>>>> * during rewrite (like compaction), it should be persisted into data
>>>>> files.
>>>>> * during update, it needs to be carried over from the previous row.
>>>>> This is similar to the row_id carry over for row updates.
>>>>
>>>>
>>>> Sorry for the short hand.  These are not the same:
>>>>
>>>> 1.  Insertion time - time the row was inserted.
>>>> 2.  Create by - The system that created the record.
>>>> 3.  Updated by - The system that last updated the record.
>>>>
>>>> Depending on the exact use-case these might or might not have utility.
>>>> I'm just wondering if there will be more example like this in the future.
>>>>
>>>>
>>>> created_by column would incur likely significantly higher storage
>>>>> overhead compared to the updated_by column. As rows are updated overtime,
>>>>> the cardinality for this column in data files can be high. Hence, the
>>>>> created_by column may not compress well. This is a similar problem for the
>>>>> row_id column. One side effect of enabling row lineage by default for V3
>>>>> tables is the storage overhead of row_id column after compaction 
>>>>> especially
>>>>> for narrow tables with few columns.
>>>>
>>>>
>>>> I agree.  I think this analysis also shows that some consumers of
>>>> Iceberg might not necessarily want to have all these columns, so we might
>>>> want to make them configurable, rather than mandating them for all tables.
>>>> Ryan's thought on default values seems like it would solve the issues I was
>>>> raising.
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>>> > An explicit timestamp column adds more burden to application
>>>>> developers. While some databases require an explicit column in the schema,
>>>>> those databases provide triggers to auto set the column value. For 
>>>>> Iceberg,
>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>
>>>>> Since the use cases don't require an exact timestamp, this seems like
>>>>> the best solution to get what people want (an insertion timestamp) that 
>>>>> has
>>>>> clear and well-defined behavior. Since `current_timestamp` is defined by
>>>>> the SQL spec, it makes sense to me that we could use it and have 
>>>>> reasonable
>>>>> behavior.
>>>>>
>>>>> I've talked with Anton about this before and maybe he'll jump in on
>>>>> this thread. I think that we may need to extend default values to include
>>>>> default value expressions, like `current_timestamp` that is allowed by the
>>>>> SQL spec. That would solve the problem as well as some others (like
>>>>> `current_date` or `current_user`) and would not create a potentially
>>>>> misleading (and heavyweight) timestamp feature in the format.
>>>>>
>>>>> > Also some environments may have stronger clock service, like Spanner
>>>>> TrueTime service.
>>>>>
>>>>> Even in cases like this, commit retries can reorder commits and make
>>>>> timestamps out of order. I don't think that we should be making guarantees
>>>>> or even exposing metadata that people might mistake as having those
>>>>> guarantees.
>>>>>
>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <[email protected]> wrote:
>>>>>
>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>
>>>>>> Regarding the concern for reliable timestamps, we are not proposing
>>>>>> using timestamps for ordering. With NTP in modern computers, they are
>>>>>> generally reliable enough for the intended use cases. Also some
>>>>>> environments may have stronger clock service, like Spanner TrueTime
>>>>>> service
>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>> .
>>>>>>
>>>>>> >  joining to timestamps from the snapshots metadata table.
>>>>>>
>>>>>> As you also mentioned, it depends on the snapshot history, which is
>>>>>> often retained for a few days due to performance reasons.
>>>>>>
>>>>>> > embedding a timestamp in DML (like `current_timestamp`) rather than
>>>>>> relying on an implicit one from table metadata.
>>>>>>
>>>>>> An explicit timestamp column adds more burden to application
>>>>>> developers. While some databases require an explicit column in the 
>>>>>> schema,
>>>>>> those databases provide triggers to auto set the column value. For 
>>>>>> Iceberg,
>>>>>> the snapshot timestamp is the closest to the trigger timestamp.
>>>>>>
>>>>>> Also, the timestamp set during computation (like streaming ingestion
>>>>>> or relative long batch computation) doesn't capture the time the 
>>>>>> rows/files
>>>>>> are added to the Iceberg table in a batch fashion.
>>>>>>
>>>>>> > And for those use cases, you could also keep a longer history of
>>>>>> snapshot timestamps, like storing a catalog's event log for long-term
>>>>>> access to timestamp info
>>>>>>
>>>>>> this is not really consumable by joining the regular table query with
>>>>>> catalog event log. I would also imagine catalog event log is capped at
>>>>>> shorter retention (maybe a few months) compared to data retention (could 
>>>>>> be
>>>>>> a few years).
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>>> I don't think it is a good idea to expose timestamps at the row
>>>>>>> level. Timestamps in metadata that would be carried down to the row 
>>>>>>> level
>>>>>>> already confuse people that expect them to be useful or reliable, rather
>>>>>>> than for debugging. I think extending this to the row level would only 
>>>>>>> make
>>>>>>> the problem worse.
>>>>>>>
>>>>>>> You can already get this information by projecting the last updated
>>>>>>> sequence number, which is reliable, and joining to timestamps from the
>>>>>>> snapshots metadata table. Of course, the drawback there is losing the
>>>>>>> timestamp information when snapshots expire, but since it isn't reliable
>>>>>>> anyway I'd be fine with that.
>>>>>>>
>>>>>>> Some of the use cases, like auditing and compliance, are probably
>>>>>>> better served by embedding a timestamp in DML (like `current_timestamp`)
>>>>>>> rather than relying on an implicit one from table metadata. And for
>>>>>>> those use cases, you could also keep a longer history of snapshot
>>>>>>> timestamps, like storing a catalog's event log for long-term access to
>>>>>>> timestamp info. I think that would be better than storing it at the row
>>>>>>> level.
>>>>>>>
>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> For V4 spec, I have a small proposal [1] to expose the row
>>>>>>>> timestamp concept that can help with many use cases like temporal 
>>>>>>>> queries,
>>>>>>>> latency tracking, TTL, auditing and compliance.
>>>>>>>>
>>>>>>>> This *_last_updated_timestamp_ms * metadata column behaves very
>>>>>>>> similarly to the *_last_updated_sequence_number* for row lineage.
>>>>>>>>
>>>>>>>>    - Initially, it inherits from the snapshot timestamp.
>>>>>>>>    - During rewrite (like compaction), its values are persisted in
>>>>>>>>    the data files.
>>>>>>>>
>>>>>>>> Would love to hear what you think.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Steven
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to