Re: Sessionization using updateStateByKey

2015-07-15 Thread algermissen1971
On 15 Jul 2015, at 17:38, Cody Koeninger wrote: > An in-memory hash key data structure of some kind so that you're close to > linear on the number of items in a batch, not the number of outstanding keys. > That's more complex, because you have to deal with expiration for keys that > never ge

Re: Sessionization using updateStateByKey

2015-07-15 Thread Cody Koeninger
to the stats any longer for length > of sessions unfortunately, but I seem to remember they were around 10-30 > minutes long. Even with peaks in volume, Spark managed to keep up very well. > > Thanks, > Silvio > > From: Cody Koeninger > Date: Wednesday, July 15, 2015 at

Re: Sessionization using updateStateByKey

2015-07-15 Thread Sean McNamara
nately, but I seem to remember they were around 10-30 minutes long. Even with peaks in volume, Spark managed to keep up very well. Thanks, Silvio From: Cody Koeninger Date: Wednesday, July 15, 2015 at 5:38 PM To: algermissen1971 Cc: Tathagata Das, swetha, user Subject: Re: Sessionization

Re: Sessionization using updateStateByKey

2015-07-15 Thread Silvio Fiorito
Subject: Re: Sessionization using updateStateByKey An in-memory hash key data structure of some kind so that you're close to linear on the number of items in a batch, not the number of outstanding keys. That's more complex, because you have to deal with expiration for keys that never get hi

Re: Sessionization using updateStateByKey

2015-07-15 Thread Cody Koeninger
An in-memory hash key data structure of some kind so that you're close to linear on the number of items in a batch, not the number of outstanding keys. That's more complex, because you have to deal with expiration for keys that never get hit, and for unusually long sessions you have to either drop

Re: Sessionization using updateStateByKey

2015-07-15 Thread algermissen1971
Hi Cody, oh ... I though that was one of *the* use cases for it. Do you have a suggestion / best practice how to achieve the same thing with better scaling characteristics? Jan On 15 Jul 2015, at 15:33, Cody Koeninger wrote: > I personally would try to avoid updateStateByKey for sessionizati

Re: Sessionization using updateStateByKey

2015-07-15 Thread Cody Koeninger
I personally would try to avoid updateStateByKey for sessionization when you have long sessions / a lot of keys, because it's linear on the number of keys. On Tue, Jul 14, 2015 at 6:25 PM, Tathagata Das wrote: > [Apologies for repost, for those who have seen this response already in > the dev ma

Re: Sessionization using updateStateByKey

2015-07-14 Thread Tathagata Das
[Apologies for repost, for those who have seen this response already in the dev mailing list] 1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves the state RDD (which is a snapshot of all the state data) to HDFS using RDD checkpointing. In fact, a streaming app wi