Hello All, Firstly I am using beam java sdk 2.23.0.
I have a use case where I continuously read streaming data from Kafka and dump output to Elasticsearch after doing a bunch of PTransforms. One such transform depends on the number of requests we have seen so far in the last one hour (Last one hour since current time when data is being processed) for a particular key. I have defined transform like following ReadTransform --->Apply_Sliding_window_1Hr_every_1Min--->generate_KVTranform-->StatefulProcessingTransform(where state maintains number of requests seen so far)---->Apply_fixed_window_5secs--->GroupByKey.create()---->Remove_DupFromSlidingWindows--->WriteLogToES 1. I am applying a sliding window of 1 hour (every 1min) so that I can get a snapshot of the last 1 hr. 2. Stateful Processing Transform adds last hour req count based on state maintained in a window per key. 3. NOW--since multiple overlapping windows can process and produce same Log msg, I do following additional steps to emit only one unique log: a) Firstly in stateful processing.. I add current window start time to each log b) Apply fixed window 5 secs c). GroupByCreate() --> which should be producing max 60 outputs d) In Remove_DupFromSlidingWindows I sort Iterable based on each log's window start time (added in (a)) and emit log from window with earliest timestamp which will give most accurate snapshot of last 1 hour req count. I am not sure if performance wise I am solving my use case correctly. I want to avoid these additional steps, especially GroupBy.Create() and then deDup. Is there any better way of solving this use case ? Any advice will be greatly appreciated. Thanks and Regards Mohil