Hi Yuan,

Haven't heard about this before. Which flink version do you use? The cause
may be:
1. userId not 100% identical, for example contains invisible characters.
2. The machine clock vibrated.

Otherwise,  there are some bugs we don't know.

Best, Hequn

On Thu, Jul 12, 2018 at 8:00 PM, Yuan,Youjun <yuanyou...@baidu.com> wrote:

> Hi Timo,
>
>
>
> This problem happens 4-5 times a day on our online server, with ~15k
> events per second load, and it is using PROCESSING time. So I don’t think I
> can stably reproduce the issue on my local machine.
>
> The user ids are actually the same, I have doubled checked that.
>
>
>
> Now, I am wondering could it possible that, after a window fires, some
> last events came but that still fall to the time range of the just fired
> window, hence new windows are assigned, and fired later. This can explain
> why the extra records always contain only a few events (cnt is small).
>
>
>
> To verify that, I just modified the SQL to also output the MIN timestamp
> of each windows, and I found the MIN timestamp of the *extra records are
> always point to the LAST second of the window*.
>
> Here is the output I just got, note *1531395119 *is the last second of a
> 2-minute window start from* 1531395000.*
>
> ------------------------------------------------------------
> --------------------------------------------------------------------
>
> {"timestamp":1531394760000,"cnt":1536013,"userId":"user01"
> ,"min_sec":1531394760}
>
> {"timestamp":1531394880000,"cnt":1459623,"userId":"user01"
> ,"min_sec":1531394879}
>
> {"timestamp":*1531395000000*,"cnt":*1446010*,"userId":"user01","min_sec":
> *1531395000*}
>
> {"timestamp":*1531395000000*,"cnt":*7*,"userId":"user01","min_sec":
> *1531395119*}
>
> {"timestamp":1531395000000,"cnt":3,"userId":"user01","min_sec":1531395119}
>
> {"timestamp":1531395000000,"cnt":3,"userId":"user01","min_sec":1531395119}
>
> {"timestamp":1531395000000,"cnt":6,"userId":"user01","min_sec":1531395119}
>
> {"timestamp":1531395000000,"cnt":3,"userId":"user01","min_sec":1531395119}
>
> {"timestamp":1531395000000,"cnt":2,"userId":"user01","min_sec":1531395119}
>
> {"timestamp":1531395000000,"cnt":2,"userId":"user01","min_sec":1531395119}
>
> {"timestamp":1531395000000,"cnt":2,"userId":"user01","min_sec":1531395119}
>
>
>
> The modified SQL:
>
> INSERT INTO sink
>
> SELECT
>
>                 TUMBLE_START(rowtime, INTERVAL '2' MINUTE) AS
> `timestamp`,
>
>                 count(vehicleId) AS cnt, userId,
>
>                 *MIN(EXTRACT(EPOCH FROM rowtime)) AS min_sec*
>
> FROM source
>
> GROUP BY
>
>                 TUMBLE(rowtime, INTERVAL '2' MINUTE),
>
>                 userId
>
>
>
> thanks
>
> Youjun
>
>
>
> *发件人**:* Timo Walther <twal...@apache.org>
> *发送时间:* Thursday, July 12, 2018 5:02 PM
> *收件人:* user@flink.apache.org
> *主题:* Re: TumblingProcessingTimeWindow emits extra results for a same
> window
>
>
>
> Hi Yuan,
>
> this sounds indeed weird. The SQL API uses regular DataStream API windows
> underneath so this problem should have come up earlier if this is problem
> in the implementation. Does this behavior reproducible on your local
> machine?
>
> One thing that comes to my mind is that the "userId"s might not be 100%
> identical (same hashCode/equals method) because otherwise they would be
> properly grouped.
>
> Regards,
> Timo
>
> Am 12.07.18 um 09:35 schrieb Yuan,Youjun:
>
> Hi community,
>
>
>
> I have a job which counts event number every 2 minutes, with
> TumblingWindow in ProcessingTime. However, it occasionally produces extra
> DUPLICATED records. For instance, for timestamp 1531368480000 below, it
> emits a normal result (cnt=1641161), and then followed by a few more
> records with very small result (2, 3, etc).
>
>
>
> Can anyone shed some light on the possible reason, or how to fix it?
>
>
>
> Below are the sample output.
>
> -----------------------------------------------------------
>
> {"timestamp":1531368240000,"cnt":1537821,"userId":"user01"}
>
> {"timestamp":1531368360000,"cnt":1521464,"userId":"user01"}
>
> {"timestamp":*1531368480000*,"cnt":1641161,"userId":"user01"}
>
> {"timestamp":*1531368480000*,"cnt":2,"userId":"user01"}
>
> {"timestamp":*1531368480000*,"cnt":3,"userId":"user01"}
>
> {"timestamp":*1531368480000*,"cnt":3,"userId":"user01"}
>
>
>
> And here is the job SQL:
>
> -----------------------------------------------------------
>
> INSERT INTO sink
>
> SELECT
>
>                 TUMBLE_START(rowtime, INTERVAL '2' MINUTE) AS `timestamp`,
>
>                 count(vehicleId) AS cnt,
>
>                 userId
>
> FROM source
>
>                 GROUP BY TUMBLE(rowtime, INTERVAL '2' MINUTE),
>
>                 userId
>
>
>
> Thanks,
>
> Youjun Yuan
>
>
>

Reply via email to