Hi Yuan,
this sounds indeed weird. The SQL API uses regular DataStream API
windows underneath so this problem should have come up earlier if this
is problem in the implementation. Does this behavior reproducible on
your local machine?
One thing that comes to my mind is that the "userId"s might not be 100%
identical (same hashCode/equals method) because otherwise they would be
properly grouped.
Regards,
Timo
Am 12.07.18 um 09:35 schrieb Yuan,Youjun:
Hi community,
I have a job which counts event number every 2 minutes, with
TumblingWindow in ProcessingTime. However, it occasionally produces
extra DUPLICATED records. For instance, for timestamp 1531368480000
below, it emits a normal result (cnt=1641161), and then followed by a
few more records with very small result (2, 3, etc).
Can anyone shed some light on the possible reason, or how to fix it?
Below are the sample output.
-----------------------------------------------------------
{"timestamp":1531368240000,"cnt":1537821,"userId":"user01"}
{"timestamp":1531368360000,"cnt":1521464,"userId":"user01"}
{"timestamp":*1531368480000*,"cnt":1641161,"userId":"user01"}
{"timestamp":*1531368480000*,"cnt":2,"userId":"user01"}
{"timestamp":*1531368480000*,"cnt":3,"userId":"user01"}
{"timestamp":*1531368480000*,"cnt":3,"userId":"user01"}
And here is the job SQL:
-----------------------------------------------------------
INSERT INTO sink
SELECT
TUMBLE_START(rowtime, INTERVAL '2' MINUTE) AS
`timestamp`,
count(vehicleId) AS cnt,
userId
FROM source
GROUP BY TUMBLE(rowtime, INTERVAL '2' MINUTE),
userId
Thanks,
Youjun Yuan