Re: Should DDL operations always create new snapshots?

Vladimir Ozerov Tue, 13 May 2025 06:59:18 -0700

In my example, the expected results for "SELECT * FROM t" are:

v1: CREATE TABLE t (a)
{}

v2: INSERT INTO t VALUES (1)
{a=1}

v3: ALTER TABLE t ADD COLUMN b
{a=1, b=null}

v4: UPDATE t SET b = 2
{a=1, b=2}

The problem is that the state v3 is reachable only if it is the last
operation on the table (i.e., no v4 yet). But after v4 is applied, v3 is
not reachable any more. This is counter-intuitive because users will face
scenarios like "I created this report with that SQL yesterday and would
like to re-execute it again on that same data ... Column not found error".

Time-travel in virtually all systems come with a long list of limitations,
so it is expected that certain scenarios will not work. However, the less
counter-intuitive scenarios, the better. The problem arise because on the
one hand snapshots track only physical data changes, but on the other hand
these physical changes are also associated with logical schemas. And from
the user perspective, DDL operations can change data even if they do not
change physical files. ADD COLUMN is one example, DROP COLUMN is another -
all physical files are the same, but table state is different. Column data
type change is another example.

My question arise because Iceberg design already properly ties logical
schema to physical layout, this is why Snapshot contains schemaId. But for
some reason some schema changes that change data from the user perspective
do not produce new snapshots. And it seems that this could be fixed easily
with a little change to spec and UpdateSchema operation to force it create
a new snapshot (which is a lightweight op), because data is actually
*changed* even though all physical files stay the same, and any reasonable
engine already reflect it for so long this is the last operation.

Surely, this could be handled at the engine level as well. But engine
capabilities are rarely aligned.

On Tue, May 13, 2025 at 2:30 AM Brian Hulette <bhule...@apache.org> wrote:

> I've also been a bit confused by this edge case. Is  {a: 1, b: NULL} the
> correct result for querying at "time" v3?
>
> If so I agree it's a bit finicky for engines to produce that result. It
> can't be determined from the snapshot-log alone, IIUC they'll need to look
> at the metadata-log to find the active schema at that time.
>
> On Mon, May 12, 2025 at 3:42 PM Ryan Blue <rdb...@gmail.com> wrote:
>
>> Snapshots are created when data changes and there is no change to the
>> data tree at “time” v3. If you want to create new snapshots when the schema
>> changes it is alright to do it, but I don’t think that we need to require
>> it in the spec.
>>
>> Also, it isn’t clear to me why the time travel query would resolve * as
>> a and b when time traveling. That error shows that there is an inconsistent
>> schema between time(v3) and the current schema (where b comes from). What
>> happens when you run `SELECT * FROM t FOR TIMESTAMP time(v2)`?
>>
>> Ryan
>>
>> On Fri, May 9, 2025 at 4:03 AM Vladimir Ozerov <voze...@querifylabs.com>
>> wrote:
>>
>>> Hi,
>>>
>>> When working on a time-travel in Trino I noticed the behavior which
>>> seems counter-intuitive from the user perspective. Consider the following
>>> sequence of SQL queries executed from Trino (could be any other query
>>> engine). Here we create a table, populate it with some data, and then add
>>> another column. The third operation is implemented in Trino via
>>> Table.updateSchema().addColumn(...).commit().
>>>
>>> CREATE TABLE t (a);        // v1
>>> INSERT INTO t VALUES (1);  // v2
>>> ALTER TABLE t ADD COLUMN b // v3
>>> UPDATE t SET b = 2;        // v4
>>>
>>> The trick is that the third operation creates a schema, but *doesn't*
>>> create a snapshot. After that we endup with the following table metadata:
>>> {
>>>     snapshots: {
>>>         v1: { schemaId: 0 } // Initial create
>>>         v2: { schemaId: 0 } // Insert using the old schema
>>>         v4: { schemaId: 1 } // Update using the new schema
>>>     }
>>>     schemaIds: { 0, 1 }
>>>     currentSnapshot: v4
>>>     currentSchema: 1
>>> }
>>>
>>> What happens next in Trino:
>>>
>>> SELECT * FROM t;                  // Executed on { snapshot=v4,
>>> schemaId=1 }, returns { a=1, b=2 } as expected
>>> SELECT * FROM t FOR TIMESTAMP v3; // Executed on { snapshot=v2,
>>> schemaId=0 }, throws exception: column "b" is not found
>>>
>>> That is, we cannot query the table after the new column is added, but
>>> before it is updated. This happens because there is no dedicated snapshot
>>> for UpdateSchema action. To make things even more confusing, if we do not
>>> have the operation v4 then we will be able to query the v3 state, because
>>> engines will basically "reconstruct" it as a virtual snapshot using
>>> TableMetadata.currentSnapshot and TableMetadata.currentSchema:
>>>
>>> SELECT * FROM t;                  // Executed on { snapshot=v2,
>>> schemaId=1 }, returns { a=1, b=null } as expected
>>> SELECT * FROM t FOR TIMESTAMP v3; // Executed on { snapshot=v2,
>>> schemaId=0 }, throws exception: column "b" is not found
>>>
>>> My question is: shouldn't we always implicitly add new snapshots for all
>>> schema changes? Because otherwise we cannot read all existing table states
>>> which is confusing. Of course, this could be handled on the engine side,
>>> but this will lead to inconsistent behaviour between different engines
>>> depending on their internal implementation details.
>>>
>>> Regards,
>>> --
>>> *Vladimir Ozerov*
>>>
>>

-- 
*Vladimir Ozerov*
Founder
querifylabs.com

Re: Should DDL operations always create new snapshots?

Reply via email to