Hi Vladimir, I understand the problem now. In your example:
v1: CREATE TABLE t (a) {} v2: INSERT INTO t VALUES (1) {a=1} v3: ALTER TABLE t ADD COLUMN b {a=1, b=null} v4: UPDATE t SET b = 2 {a=1, b=2} there are only 2 snapshots: at v2 and at v4. Let's call the snapshots s1 (at v2) and s2 (at v4). Note that there is no snapshot at v1 either. At the state v3, the non-time-travel query, SELECT * FROM t, applies the current schema to the current snapshot s1. However, time-travel queries, whether using snapshot-id or timestamp, looks for the snapshot satisfying the criterion (either the snapshot with that snapshot-id, or the snapshot as of that timestamp), and then uses the *schema as of that snapshot*. There is no way to specify schema as of a timestamp. I think your problem can be solved if we allow for that, *without requiring that schema changes produce new snapshots*. Thus an alternative proposal is: Associate a timestamp with each schema, in addition to the schema-id. For time travel by snapshot-id (or tag), there is no change in behavior. We find the snapshot, and use the schema for that snapshot. For time travel by timestamp, we find the schema as well as the snapshot as of the timestamp. Then we use that schema with the snapshot. Without a spec change to associate a timestamp with each schema, we can still find a schema as of the timestamp, by going through the metadata-log to find the metadata.json as of that timestamp and reading the metadata.json to find the current-schema-id. But this would be slow as it involves reading another metadata.json. - Wing Yew On Tue, May 13, 2025 at 6:59 AM Vladimir Ozerov <voze...@querifylabs.com> wrote: > In my example, the expected results for "SELECT * FROM t" are: > > v1: CREATE TABLE t (a) > {} > > v2: INSERT INTO t VALUES (1) > {a=1} > > v3: ALTER TABLE t ADD COLUMN b > {a=1, b=null} > > v4: UPDATE t SET b = 2 > {a=1, b=2} > > The problem is that the state v3 is reachable only if it is the last > operation on the table (i.e., no v4 yet). But after v4 is applied, v3 is > not reachable any more. This is counter-intuitive because users will face > scenarios like "I created this report with that SQL yesterday and would > like to re-execute it again on that same data ... Column not found error". > > Time-travel in virtually all systems come with a long list of limitations, > so it is expected that certain scenarios will not work. However, the less > counter-intuitive scenarios, the better. The problem arise because on the > one hand snapshots track only physical data changes, but on the other hand > these physical changes are also associated with logical schemas. And from > the user perspective, DDL operations can change data even if they do not > change physical files. ADD COLUMN is one example, DROP COLUMN is another - > all physical files are the same, but table state is different. Column data > type change is another example. > > My question arise because Iceberg design already properly ties logical > schema to physical layout, this is why Snapshot contains schemaId. But for > some reason some schema changes that change data from the user perspective > do not produce new snapshots. And it seems that this could be fixed easily > with a little change to spec and UpdateSchema operation to force it create > a new snapshot (which is a lightweight op), because data is actually > *changed* even though all physical files stay the same, and any reasonable > engine already reflect it for so long this is the last operation. > > Surely, this could be handled at the engine level as well. But engine > capabilities are rarely aligned. > > > > On Tue, May 13, 2025 at 2:30 AM Brian Hulette <bhule...@apache.org> wrote: > >> I've also been a bit confused by this edge case. Is {a: 1, b: NULL} the >> correct result for querying at "time" v3? >> >> If so I agree it's a bit finicky for engines to produce that result. It >> can't be determined from the snapshot-log alone, IIUC they'll need to look >> at the metadata-log to find the active schema at that time. >> >> On Mon, May 12, 2025 at 3:42 PM Ryan Blue <rdb...@gmail.com> wrote: >> >>> Snapshots are created when data changes and there is no change to the >>> data tree at “time” v3. If you want to create new snapshots when the schema >>> changes it is alright to do it, but I don’t think that we need to require >>> it in the spec. >>> >>> Also, it isn’t clear to me why the time travel query would resolve * as >>> a and b when time traveling. That error shows that there is an inconsistent >>> schema between time(v3) and the current schema (where b comes from). What >>> happens when you run `SELECT * FROM t FOR TIMESTAMP time(v2)`? >>> >>> Ryan >>> >>> On Fri, May 9, 2025 at 4:03 AM Vladimir Ozerov <voze...@querifylabs.com> >>> wrote: >>> >>>> Hi, >>>> >>>> When working on a time-travel in Trino I noticed the behavior which >>>> seems counter-intuitive from the user perspective. Consider the following >>>> sequence of SQL queries executed from Trino (could be any other query >>>> engine). Here we create a table, populate it with some data, and then add >>>> another column. The third operation is implemented in Trino via >>>> Table.updateSchema().addColumn(...).commit(). >>>> >>>> CREATE TABLE t (a); // v1 >>>> INSERT INTO t VALUES (1); // v2 >>>> ALTER TABLE t ADD COLUMN b // v3 >>>> UPDATE t SET b = 2; // v4 >>>> >>>> The trick is that the third operation creates a schema, but *doesn't* >>>> create a snapshot. After that we endup with the following table metadata: >>>> { >>>> snapshots: { >>>> v1: { schemaId: 0 } // Initial create >>>> v2: { schemaId: 0 } // Insert using the old schema >>>> v4: { schemaId: 1 } // Update using the new schema >>>> } >>>> schemaIds: { 0, 1 } >>>> currentSnapshot: v4 >>>> currentSchema: 1 >>>> } >>>> >>>> What happens next in Trino: >>>> >>>> SELECT * FROM t; // Executed on { snapshot=v4, >>>> schemaId=1 }, returns { a=1, b=2 } as expected >>>> SELECT * FROM t FOR TIMESTAMP v3; // Executed on { snapshot=v2, >>>> schemaId=0 }, throws exception: column "b" is not found >>>> >>>> That is, we cannot query the table after the new column is added, but >>>> before it is updated. This happens because there is no dedicated snapshot >>>> for UpdateSchema action. To make things even more confusing, if we do not >>>> have the operation v4 then we will be able to query the v3 state, because >>>> engines will basically "reconstruct" it as a virtual snapshot using >>>> TableMetadata.currentSnapshot and TableMetadata.currentSchema: >>>> >>>> SELECT * FROM t; // Executed on { snapshot=v2, >>>> schemaId=1 }, returns { a=1, b=null } as expected >>>> SELECT * FROM t FOR TIMESTAMP v3; // Executed on { snapshot=v2, >>>> schemaId=0 }, throws exception: column "b" is not found >>>> >>>> My question is: shouldn't we always implicitly add new snapshots for >>>> all schema changes? Because otherwise we cannot read all existing table >>>> states which is confusing. Of course, this could be handled on the engine >>>> side, but this will lead to inconsistent behaviour between different >>>> engines depending on their internal implementation details. >>>> >>>> Regards, >>>> -- >>>> *Vladimir Ozerov* >>>> >>> > > -- > *Vladimir Ozerov* > Founder > querifylabs.com >