Hi Yuan, Haven't heard about this before. Which flink version do you use? The cause may be: 1. userId not 100% identical, for example contains invisible characters. 2. The machine clock vibrated.
Otherwise, there are some bugs we don't know. Best, Hequn On Thu, Jul 12, 2018 at 8:00 PM, Yuan,Youjun <yuanyou...@baidu.com> wrote: > Hi Timo, > > > > This problem happens 4-5 times a day on our online server, with ~15k > events per second load, and it is using PROCESSING time. So I don’t think I > can stably reproduce the issue on my local machine. > > The user ids are actually the same, I have doubled checked that. > > > > Now, I am wondering could it possible that, after a window fires, some > last events came but that still fall to the time range of the just fired > window, hence new windows are assigned, and fired later. This can explain > why the extra records always contain only a few events (cnt is small). > > > > To verify that, I just modified the SQL to also output the MIN timestamp > of each windows, and I found the MIN timestamp of the *extra records are > always point to the LAST second of the window*. > > Here is the output I just got, note *1531395119 *is the last second of a > 2-minute window start from* 1531395000.* > > ------------------------------------------------------------ > -------------------------------------------------------------------- > > {"timestamp":1531394760000,"cnt":1536013,"userId":"user01" > ,"min_sec":1531394760} > > {"timestamp":1531394880000,"cnt":1459623,"userId":"user01" > ,"min_sec":1531394879} > > {"timestamp":*1531395000000*,"cnt":*1446010*,"userId":"user01","min_sec": > *1531395000*} > > {"timestamp":*1531395000000*,"cnt":*7*,"userId":"user01","min_sec": > *1531395119*} > > {"timestamp":1531395000000,"cnt":3,"userId":"user01","min_sec":1531395119} > > {"timestamp":1531395000000,"cnt":3,"userId":"user01","min_sec":1531395119} > > {"timestamp":1531395000000,"cnt":6,"userId":"user01","min_sec":1531395119} > > {"timestamp":1531395000000,"cnt":3,"userId":"user01","min_sec":1531395119} > > {"timestamp":1531395000000,"cnt":2,"userId":"user01","min_sec":1531395119} > > {"timestamp":1531395000000,"cnt":2,"userId":"user01","min_sec":1531395119} > > {"timestamp":1531395000000,"cnt":2,"userId":"user01","min_sec":1531395119} > > > > The modified SQL: > > INSERT INTO sink > > SELECT > > TUMBLE_START(rowtime, INTERVAL '2' MINUTE) AS > `timestamp`, > > count(vehicleId) AS cnt, userId, > > *MIN(EXTRACT(EPOCH FROM rowtime)) AS min_sec* > > FROM source > > GROUP BY > > TUMBLE(rowtime, INTERVAL '2' MINUTE), > > userId > > > > thanks > > Youjun > > > > *发件人**:* Timo Walther <twal...@apache.org> > *发送时间:* Thursday, July 12, 2018 5:02 PM > *收件人:* user@flink.apache.org > *主题:* Re: TumblingProcessingTimeWindow emits extra results for a same > window > > > > Hi Yuan, > > this sounds indeed weird. The SQL API uses regular DataStream API windows > underneath so this problem should have come up earlier if this is problem > in the implementation. Does this behavior reproducible on your local > machine? > > One thing that comes to my mind is that the "userId"s might not be 100% > identical (same hashCode/equals method) because otherwise they would be > properly grouped. > > Regards, > Timo > > Am 12.07.18 um 09:35 schrieb Yuan,Youjun: > > Hi community, > > > > I have a job which counts event number every 2 minutes, with > TumblingWindow in ProcessingTime. However, it occasionally produces extra > DUPLICATED records. For instance, for timestamp 1531368480000 below, it > emits a normal result (cnt=1641161), and then followed by a few more > records with very small result (2, 3, etc). > > > > Can anyone shed some light on the possible reason, or how to fix it? > > > > Below are the sample output. > > ----------------------------------------------------------- > > {"timestamp":1531368240000,"cnt":1537821,"userId":"user01"} > > {"timestamp":1531368360000,"cnt":1521464,"userId":"user01"} > > {"timestamp":*1531368480000*,"cnt":1641161,"userId":"user01"} > > {"timestamp":*1531368480000*,"cnt":2,"userId":"user01"} > > {"timestamp":*1531368480000*,"cnt":3,"userId":"user01"} > > {"timestamp":*1531368480000*,"cnt":3,"userId":"user01"} > > > > And here is the job SQL: > > ----------------------------------------------------------- > > INSERT INTO sink > > SELECT > > TUMBLE_START(rowtime, INTERVAL '2' MINUTE) AS `timestamp`, > > count(vehicleId) AS cnt, > > userId > > FROM source > > GROUP BY TUMBLE(rowtime, INTERVAL '2' MINUTE), > > userId > > > > Thanks, > > Youjun Yuan > > >