Hi Uwe,

Your use case seems to me is more like a state-management case. What comes
to my mind is that,
1) every time a song is played, you updates the count of this song. You do
not put the map in memory, as you said, the memory could be quite large.
Instead, you use Samza's build-in key-value storage. ( you do all this in
process method )

2) you scan the whole key-value DB every, say, one hour. ( you do all this
in window method)

* This could provide better fault-tolerance ( for example, your machine is
down during the one hour. you will not lose any count number by restoring
the key-value DB)

Some relevant links:
*
http://samza.apache.org/learn/documentation/0.8/container/state-management.html#windowed-aggregation
*
http://samza.apache.org/learn/documentation/0.8/container/state-management.html#approaches-to-managing-task-state
*
http://samza.apache.org/learn/documentation/0.8/container/state-management.html#key-value-storage

Hope this helps.

Cheers,

Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108

On Tue, Feb 17, 2015 at 11:35 AM, Uwe Dauernheim <u...@dauernheim.net> wrote:

> I try to model a music charts system to get familiar with Samza.
> Charts are defined by the top N entries with highest count of a map
> from unique track ID, basically a song, to counter, basically the
> amount of plays of this entity, during a sliding time-window.
>
> The problem I see is that of an evergrowing size of this map as the ID
> space of tracks can be quite large (let's pick 2E6). Not all of these
> IDs will be played (thus should be counted) within a given time-window
> (let's pick 1 hour) but it's not obvious to me when to prune the map
> during this sliding time-window.
>
> I assume dealing with sliding time-windows is a common case for stream
> processing thus some useful API provided by Samza. Does an example or
> tutorial for this kind of sliding time-window counting example exist?
>

Reply via email to