Hi Benjamin,
  I'm trying to employ samza for a similar use case and this is what I did
to mitigate this:
1> I have a notion of timestamp in the messages itself that I listen to.
This way, as I get messages, I can maintain state by time period of
aggregation by attaching the time period to the key of my state store. Ex.,
if I want hourly aggregations,
I store keys by getHourFromTimestamp(message.timestamp):message.key ->
state .
2> I configure a window every 10/20 mins and using samza's KVStore's
range() iterator, combined with Kafka's guaranteed in order delivery of
messages, I can ensure to flush out only aggregates of previous timestamps
once I have seen atleast one message from the next timeperiod.

I saw various issues with having control messages to dictate end of time
periods like when you have multiple producers producing to your given input
topic, how would you ensure your control message appears after all the
producers have produced for that time period? Or if the control message
happens to be lost etc., how would you ensure reliability? I have found
control messages more useful only for adhoc aggregations that may depend on
factors other than time or size of batch. but of course, it all depends on
use case.

I'm also interested to know if anyone else has solved this problem with any
other method?

Thanks,
Karthik


Thanks,
Karthik

On Sun, Feb 15, 2015 at 10:51 AM, Benjamin Edwards <edwards.b...@gmail.com>
wrote:

> Hi
>
> Based on what I can see in the run loop class, there are a few things that
> seem a little problematic for windowed processing with respect to time:
>
> 1) No ability to schedule *when* on an interval you might start. For
> instance, if you wanted to process a window on the hour, every hour, there
> is no way to do this.
>
> 2) You don't get passed the time. I guess this is simply due to the fact
> that the window isn't really trying to keep up, or pin itself to a given
> phase. If you get behind, well tough. You just added some phase to your
> series.
>
> What do people normally do to mitigate this? I was thinking that rather
> than using the Windowed task I would simply have the producer use a timer
> and once a period send a control message with the time stamp. This would
> indicate to my task that period was up and state should be flushed to db,
> aggregated to another stream etc..
>
> Note that I am not trying to do real time processing with hard constraints,
> or anything like that, I just need things that mostly happened within a
> given frame to get grouped and most importantly for things to happen "on
> the minute" or "on the hour" etc.
>
> Ben
>

Reply via email to