Thanks Jingsong, Jark, Knauf, Seth for sharing your thoughts.

Although we discussed many details about the concept, I think it’s worth to 
clarify the semantic from long term goals. Temporal table concept was first 
imported in SQL:2011, I made some investigation of Temporal Table work 
mechanism in traditional DBMS which implements it like SQL Server[1], 
PostgreSQL[2]
 
In DBMS, Temporal Table is implemented as a pair of tables, a current table and 
a history table.The current table contains the current value of each row, the 
history table contains each previous value for each row. Each row contains a 
time range constructed by RowStartTime and RowEndTime to define the period 
validity of the row. The RowStartTime and RowEndTime is changed by DBMS when a 
DML operation happened, Given a simple temporal table in SQL Server to show how 
it works:
        CREATE TABLE dbo.currency (
                [currency]  VARCHAR(10) NOT NULL PRIMARY KEY,
                [rate] INT,
                [RowStart] DATETIME2 GENERATED ALWAYS AS ROW START,
                [RowEnd] DATETIME2 GENERATED ALWAYS AS ROW END,
                PERIOD FOR SYSTEM_TIME (RowStart, RowEnd)
        ) WITH
        (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.currency_History));

1> select * from currency;                                                      
                 // The initial  test data, the RowEndTime is the max value of 
timestamp type
    currency   rate        RowStart                               RowEnd
    ---------- ----------- -------------------------------------- 
--------------------------------------
    Euro               114            2020-06-29 15:06:24.7459246            
9999-12-31 23:59:59.9999999
    US Dollar       102            2020-06-29 15:06:24.7503288            
9999-12-31 23:59:59.9999999
1> UPDATE dbo.currency SET [rate] = 118 WHERE currency = 'Euro’;   // UPDATE 
Euro currency
2> select * from currency_History;                                              
                    // The history table increased a record that represents the 
validity period of record (Euro,114)
    currency   rate        RowStart                               RowEnd
    ---------- ----------- -------------------------------------- 
--------------------------------------
    Euro               114            2020-06-29 15:06:24.7459246            
2020-06-29 15:07:01.1245406
1> DELETE FROM dbo.currency WHERE currency = 'Euro’;           // DELETE Euro 
currency
1> select * from currency_History;
    currency   rate        RowStart                               RowEnd        
         // The history table also increased another record that represents the 
validity period of record (Euro, 118)
    ---------- ----------- -------------------------------------- 
--------------------------------------
    Euro               114            2020-06-29 15:06:24.7459246            
2020-06-29 15:07:01.1245406
    Euro               118            2020-06-29 15:07:01.1245406            
2020-06-29 15:07:11.2981995
1> select * from currency;
    currency   rate        RowStart                               RowEnd        
         // Current table only keep the latest value 
    ---------- ----------- -------------------------------------- 
--------------------------------------
    US Dollar       102            2020-06-29 15:06:24.7503288            
9999-12-31 23:59:59.9999999

The history table is very important for history version tracking, pleas note 
the DELETE operation also increase a record in history table and the record’s 
RowEndTime is the system time that the DELETE operation happened. In one word, 
temporal table use time range [RowStart, RowEnd) to mark period validity, store 
all versions’ records in history table for history tracking, use DBMS operation 
time to change the RowStart or RowEnd.

Back to our Flink World, temporal table with event time attribute works well in 
data source that contains INSERT, UPDATE messages except DELETE currently. Let 
us see what happened in DELETE message  scenario(i.e. changelog source), both 
DBMS Temporal Table and other general table can capture data change by CDC 
tools and have same format, I used debezuim to capture a SQL server table 
changes:

1> select * from currency;
    currency   rate        RowStart                               RowEnd
    ---------- ----------- -------------------------------------- 
--------------------------------------
    Euro               118            2020-06-29 15:07:01.1245406            
9999-12-31 23:59:59.9999999
    US Dollar          102            2020-06-29 15:06:24.7503288         
9999-12-31 23:59:59.9999999
1> DELETE FROM dbo.currency WHERE currency = 'Euro’;               // DELETE 
Euro currency
1> select * from currency_History;
    currency   rate        RowStart                               RowEnd
    ---------- ----------- -------------------------------------- 
--------------------------------------
    Euro               118            2020-06-29 15:07:01.1245406            
2020-06-29 15:07:11.2981995

{ // The DELETE record produced by  CDC tools(both debezuim and canal are same) 
    "before": {
      "currency": "Euro",
      "rate": 118,
      "RowStart": 1593443221124540600,  //2020-06-29 15:07:01.1245406
      "RowEnd": -4852116231933722724   //9999-12-31 23:59:59.9999999
    },
    "after": null,
    "op": "d”,  // DELETE operation
    "ts_ms": 15934432361354,                //2020-06-29 15:07:16.354, the 
’ts_ms’ value is bigger than the record delete operation time(2020-06-29 
15:07:11.2981995)
    "transaction": null
}

The main problem is that the DELETE record only contains current table message 
which  does not contain the expected RowEnd (2020-06-29 15:07:11.2981995) in 
history table. Without the exact RowEndTime, it’s impossible to obtain exact 
previous version of temporal table in Flink, `ts_ms` filed in CDC record is an 
approximate time of RowEndTime but  it depends on CDC tool status and can not 
equal the RowEndTime from semantics angle.

Current Temporal Table Function supports:
        (1) Define temporal table backed upsert data source with process time
        (2) Define temporal table backed upsert data source with event time
I think  the proposed Temporal table currently could support:
        (1) Define temporal table backed upsert(include delete) data source 
with process time
        (2) Define temporal table backed upsert  data source with event time
        (3) Do not support define temporal table backed  data source that 
contains DELETE message with event time. Because most CDC tools can not obtain 
the exact DELETE operation  time currently, the “ts_ms” field from meta is just 
an approximate time which will break event time semantics. And we can support 
it when CDC tools have the ability to
obtain/extract the DML operation time. 
 
And this has get consensus from me, Jingsong, Jark and Kurt after offline 
discuss, the opinions from Knauf and Seth looks like same with us. 

I’ll prepare a design doc for temporal table, thanks everyone involving and 
please let me know if you have any concern.

Best,
Leonard Xu
[1] 
https://docs.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15
 
<https://docs.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver15>
 
[2] https://pgxn.org/dist/temporal_tables/ 
<https://pgxn.org/dist/temporal_tables/> 

 

Reply via email to