Hi Timo, This problem happens 4-5 times a day on our online server, with ~15k events per second load, and it is using PROCESSING time. So I don’t think I can stably reproduce the issue on my local machine. The user ids are actually the same, I have doubled checked that.
Now, I am wondering could it possible that, after a window fires, some last events came but that still fall to the time range of the just fired window, hence new windows are assigned, and fired later. This can explain why the extra records always contain only a few events (cnt is small). To verify that, I just modified the SQL to also output the MIN timestamp of each windows, and I found the MIN timestamp of the extra records are always point to the LAST second of the window. Here is the output I just got, note 1531395119 is the last second of a 2-minute window start from 1531395000. -------------------------------------------------------------------------------------------------------------------------------- {"timestamp":1531394760000,"cnt":1536013,"userId":"user01","min_sec":1531394760} {"timestamp":1531394880000,"cnt":1459623,"userId":"user01","min_sec":1531394879} {"timestamp":1531395000000,"cnt":1446010,"userId":"user01","min_sec":1531395000} {"timestamp":1531395000000,"cnt":7,"userId":"user01","min_sec":1531395119} {"timestamp":1531395000000,"cnt":3,"userId":"user01","min_sec":1531395119} {"timestamp":1531395000000,"cnt":3,"userId":"user01","min_sec":1531395119} {"timestamp":1531395000000,"cnt":6,"userId":"user01","min_sec":1531395119} {"timestamp":1531395000000,"cnt":3,"userId":"user01","min_sec":1531395119} {"timestamp":1531395000000,"cnt":2,"userId":"user01","min_sec":1531395119} {"timestamp":1531395000000,"cnt":2,"userId":"user01","min_sec":1531395119} {"timestamp":1531395000000,"cnt":2,"userId":"user01","min_sec":1531395119} The modified SQL: INSERT INTO sink SELECT TUMBLE_START(rowtime, INTERVAL '2' MINUTE) AS `timestamp`, count(vehicleId) AS cnt, userId, MIN(EXTRACT(EPOCH FROM rowtime)) AS min_sec FROM source GROUP BY TUMBLE(rowtime, INTERVAL '2' MINUTE), userId thanks Youjun 发件人: Timo Walther <twal...@apache.org> 发送时间: Thursday, July 12, 2018 5:02 PM 收件人: user@flink.apache.org 主题: Re: TumblingProcessingTimeWindow emits extra results for a same window Hi Yuan, this sounds indeed weird. The SQL API uses regular DataStream API windows underneath so this problem should have come up earlier if this is problem in the implementation. Does this behavior reproducible on your local machine? One thing that comes to my mind is that the "userId"s might not be 100% identical (same hashCode/equals method) because otherwise they would be properly grouped. Regards, Timo Am 12.07.18 um 09:35 schrieb Yuan,Youjun: Hi community, I have a job which counts event number every 2 minutes, with TumblingWindow in ProcessingTime. However, it occasionally produces extra DUPLICATED records. For instance, for timestamp 1531368480000 below, it emits a normal result (cnt=1641161), and then followed by a few more records with very small result (2, 3, etc). Can anyone shed some light on the possible reason, or how to fix it? Below are the sample output. ----------------------------------------------------------- {"timestamp":1531368240000,"cnt":1537821,"userId":"user01"} {"timestamp":1531368360000,"cnt":1521464,"userId":"user01"} {"timestamp":1531368480000,"cnt":1641161,"userId":"user01"} {"timestamp":1531368480000,"cnt":2,"userId":"user01"} {"timestamp":1531368480000,"cnt":3,"userId":"user01"} {"timestamp":1531368480000,"cnt":3,"userId":"user01"} And here is the job SQL: ----------------------------------------------------------- INSERT INTO sink SELECT TUMBLE_START(rowtime, INTERVAL '2' MINUTE) AS `timestamp`, count(vehicleId) AS cnt, userId FROM source GROUP BY TUMBLE(rowtime, INTERVAL '2' MINUTE), userId Thanks, Youjun Yuan