Hey everyone, I've been thinking about reprocessing <http://samza.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html> when my job has windowed state <http://samza.apache.org/learn/documentation/0.7.0/container/state-management.html#windowed-aggregation> and I have a few questions.
Context: I have a series of physical sensors that stream partial scans of their surroundings over the course of ~5-10 minutes (at the end of 5-10 minutes its provided a full scan of its surroundings and starts over from the start). Each packet of data has a timestamp that we're just going to have to trust is 'close enough.' When processing in real-time it's very natural to window the data every 5 minutes (wall clock) and merge into a complete view of their collective surroundings. For our purposes, old data arriving > 5 minutes late is no longer useful for many applications. Now, I'd love to be able to reprocess data, especially by increasing parallelism and processing quickly, but this seems incompatible with using wall-clock-based windowed state. I could base my windowing/binning on the timestamps provided by my input data, but then I have to be careful to handle cases where some of my data may be arbitrarily delayed and the possibility that one partition will get significantly ahead of other ones (less interesting surroundings and faster computations) which makes waiting for a majority of partitions to get to a certain point in time difficult. Does anyone have experience with windowing and reprocessing? Any literature recommendations? Thanks! Geoffry