> If you have a window larger than hours then you need to rethink your 
> architecture - this is not streaming anymore. Only because you receive events 
> in a streamed fashion you don’t need to do all the processing in a streamed 
> fashion.

Thanks for the thoughts, I’ll keep that in mind. However, in the test, it was 
not storing more than two days worth of data yet. I’m very much interested in 
understanding the root cause of the low performance before moving on to do 
major restructuring.

> Can you store the events in a file or a database and then do after 30 days 
> batch processing on them?

The 30 day window is just used for deduplication, but it triggers for every 
event and sends the result out to downstream so that we can still get real-time 
analytics on the events.

> Another aspect could be also to investigate why your source sends duplicated 
> entries.

They are not 100% duplicate events syntactically. The events are only 
duplicates from a logical sense. For example, the same person doing the same 
action multiple times at different time of day.

Ning

Reply via email to