Removing Duplicates from Sliding Window

Mohil Khare Tue, 18 Aug 2020 21:02:54 -0700

Hello All,

Firstly I am using beam java sdk 2.23.0.


I have a use case where I continuously read streaming data from Kafka and
dump output to Elasticsearch after doing a bunch of PTransforms.

One such transform depends on the number of requests we have seen so far in
the last one hour (Last one hour since current time when data is being
processed) for a particular key.

I have defined transform like following

ReadTransform
--->Apply_Sliding_window_1Hr_every_1Min--->generate_KVTranform-->StatefulProcessingTransform(where
state maintains number of requests seen so
far)---->Apply_fixed_window_5secs--->GroupByKey.create()---->Remove_DupFromSlidingWindows--->WriteLogToES


1. I am applying a sliding window of 1 hour (every 1min) so that I can get
a snapshot of the last 1 hr.
2. Stateful Processing Transform adds last hour req count based on state
maintained in a window per key.
3. NOW--since multiple overlapping windows can process and produce same Log
msg, I do following additional steps to emit only one unique log:
       a) Firstly in stateful processing.. I add current window start time
to each log
       b) Apply fixed window 5 secs
       c). GroupByCreate() --> which should be producing max 60 outputs
       d) In Remove_DupFromSlidingWindows I sort Iterable based on each
log's window start time (added in (a)) and emit log from window with
earliest timestamp which will give most accurate snapshot of last 1 hour
req count.

I am not sure if performance wise I am solving my use case correctly. I
want to avoid these additional steps, especially GroupBy.Create() and then
deDup.

Is there any better way of solving this use case ?
Any advice will be greatly appreciated.

Thanks and Regards
Mohil

Removing Duplicates from Sliding Window

Reply via email to