Hey everyone,

I've been thinking about reprocessing
<http://samza.apache.org/learn/documentation/0.7.0/jobs/reprocessing.html> when
my job has windowed state
<http://samza.apache.org/learn/documentation/0.7.0/container/state-management.html#windowed-aggregation>
and
I have a few questions.

Context: I have a series of physical sensors that stream partial scans of
their surroundings over the course of ~5-10 minutes (at the end of 5-10
minutes its provided a full scan of its surroundings and starts over from
the start). Each packet of data has a timestamp that we're just going to
have to trust is 'close enough.' When processing in real-time it's very
natural to window the data every 5 minutes (wall clock) and merge into a
complete view of their collective surroundings. For our purposes, old data
arriving > 5 minutes late is no longer useful for many applications.

Now, I'd love to be able to reprocess data, especially by increasing
parallelism and processing quickly, but this seems incompatible with using
wall-clock-based windowed state. I could base my windowing/binning on the
timestamps provided by my input data, but then I have to be careful to
handle cases where some of my data may be arbitrarily delayed and the
possibility that one partition will get significantly ahead of other ones
(less interesting surroundings and faster computations) which makes waiting
for a majority of partitions to get to a certain point in time difficult.

Does anyone have experience with windowing and reprocessing? Any literature
recommendations?

Thanks!
Geoffry

Reply via email to