> If you have a window larger than hours then you need to rethink your > architecture - this is not streaming anymore. Only because you receive events > in a streamed fashion you don’t need to do all the processing in a streamed > fashion.
Thanks for the thoughts, I’ll keep that in mind. However, in the test, it was not storing more than two days worth of data yet. I’m very much interested in understanding the root cause of the low performance before moving on to do major restructuring. > Can you store the events in a file or a database and then do after 30 days > batch processing on them? The 30 day window is just used for deduplication, but it triggers for every event and sends the result out to downstream so that we can still get real-time analytics on the events. > Another aspect could be also to investigate why your source sends duplicated > entries. They are not 100% duplicate events syntactically. The events are only duplicates from a logical sense. For example, the same person doing the same action multiple times at different time of day. Ning