Hey Dawid,

>    val sink = new TestingAppendSink
>    tEnv.sqlQuery(sql).toDataStream.addSink(sink)
>    env.execute()
>  }
> 
> and you get the misleading results for timestamp_ltz:
> 
> 
> a,2020-10-10T00:00,2020-10-10T00:00:05,2020-10-09T16:00:04.999Z,4,11.10,5.0,1.0,2,Hi|Comment#1

The reason why we saw the misleading result is that  TestingAppendSink store 
timestamp_ltz(Instant type)data in string format and we read it from the 
storage without considering session timezone, the recommend practice should be 
that we store timestamp_ltz data in epoch milliseconds (long format) and read 
it in long format, if we want to see the timestamp
 literal we should use local(session) timezone to translate the epoch 
milliseconds, and this is the semantics of SQL type TIMESTAMP WITH LOCAL TIME 
ZONE. If you deep dive to test testEventTimeTumbleWindow, you will find this 
test’s session time zone is Asia/Shanghai, and with this timezone, the 
serialized string value 2020-10-09T16:00:04.999Z of Instant should be 
translated to 2020-10-10 00:00:04.999. 

Another reason why the result is misleading, as a test connector( memory 
storage), TestingAppendSink support argument timezone, but unfortunately it 
doesn’t consider Instant type, it serialized Instant as timestamp literal 
string directly without considering session time zone, please see 
TestSinkUtils#fieldToString, it’s incorrect as if we want to obtain a local 
timestamp literal from Instant, we’d use local session time zone, instead of 
use default zone like UTC0, vice versa. If the test storage TestingAppendSink 
does not support timestamp_ltz type, and it’s only support timestamp type, it’s 
common case like various external storages(DB, Lake format, Warehouse), it’s 
flink connector should convert data with timestamp_ltz type to timestamp, 
suppose the TestingAppendSink does not support timestamp_ltz type and only 
support timestamp type, I add a method TestSinkUtil
#instantToTimestampString(value: Instant, tz: TimeZone) , the data stored in 
TestingAppendSink looks like as following: 

a,2020-10-10T00:00,2020-10-10T00:00:05,2020-10-10 
00:00:04.999,4,11.10,5.0,1.0,2,Hi|Comment#1
a,2020-10-10T00:00:05,2020-10-10T00:00:10,2020-10-10 
00:00:09.999,1,3.33,null,3.0,1,Comment#2
b,2020-10-10T00:00:05,2020-10-10T00:00:10,2020-10-10 
00:00:09.999,2,6.66,6.0,3.0,2,Hello|Hi
b,2020-10-10T00:00:15,2020-10-10T00:00:20,2020-10-10 
00:00:19.999,1,4.44,4.0,4.0,1,Hi
b,2020-10-10T00:00:30,2020-10-10T00:00:35,2020-10-10 
00:00:34.999,1,3.33,3.0,3.0,1,Comment#3
null,2020-10-10T00:00:30,2020-10-10T00:00:35,2020-10-10 
00:00:34.999,1,7.77,7.0,7.0,0,null


After clarified the misleading result, let me explain more with an example why 
we chose SQL TIMESTAMP type for window_start and window_end instead of 
TIMESTAMP_LTZ type, as you’re living in DST timezone area, I believe you know 
two special days that one day it only has 23 hours and one day it has 25 hours. 
And please imaging the user scenarios that
User have a cumulate  window with step 4 hours and size 24 hours(1day), what’s 
the expected window_start and window_end semantics in this case? Is the 
timestamp literal cross 24 hours or an instant interval 24 hours?  The 
timestamp literal value from  2021-03-14 00:00:00 to  2021-03-15 00:00:00 can 
represent one day in 23 hours (The DaylightTime in Los_Angele start at time 
2021-03-14 02:00:00), but instant interval 24 hours' value is 1615708800000L( 
2021-03-14 00:00:00) to 1615795200000L( 2021-03-15 01:00:00) , the window_end 
is in correct here, right? What users want is the timestamp literal interval 
window not instant interval window, that’s the window boundary semantics from 
my understanding, and that’s the reason why we always choose SQL TIMESTAMP type 
for window_start and window_end.

Best,
Leonard




Reply via email to