Hi Uwe, Your use case seems to me is more like a state-management case. What comes to my mind is that, 1) every time a song is played, you updates the count of this song. You do not put the map in memory, as you said, the memory could be quite large. Instead, you use Samza's build-in key-value storage. ( you do all this in process method )
2) you scan the whole key-value DB every, say, one hour. ( you do all this in window method) * This could provide better fault-tolerance ( for example, your machine is down during the one hour. you will not lose any count number by restoring the key-value DB) Some relevant links: * http://samza.apache.org/learn/documentation/0.8/container/state-management.html#windowed-aggregation * http://samza.apache.org/learn/documentation/0.8/container/state-management.html#approaches-to-managing-task-state * http://samza.apache.org/learn/documentation/0.8/container/state-management.html#key-value-storage Hope this helps. Cheers, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Tue, Feb 17, 2015 at 11:35 AM, Uwe Dauernheim <u...@dauernheim.net> wrote: > I try to model a music charts system to get familiar with Samza. > Charts are defined by the top N entries with highest count of a map > from unique track ID, basically a song, to counter, basically the > amount of plays of this entity, during a sliding time-window. > > The problem I see is that of an evergrowing size of this map as the ID > space of tracks can be quite large (let's pick 2E6). Not all of these > IDs will be played (thus should be counted) within a given time-window > (let's pick 1 hour) but it's not obvious to me when to prune the map > during this sliding time-window. > > I assume dealing with sliding time-windows is a common case for stream > processing thus some useful API provided by Samza. Does an example or > tutorial for this kind of sliding time-window counting example exist? >