Hi Cody, oh ... I though that was one of *the* use cases for it. Do you have a suggestion / best practice how to achieve the same thing with better scaling characteristics?
Jan On 15 Jul 2015, at 15:33, Cody Koeninger <[email protected]> wrote: > I personally would try to avoid updateStateByKey for sessionization when you > have long sessions / a lot of keys, because it's linear on the number of keys. > > On Tue, Jul 14, 2015 at 6:25 PM, Tathagata Das <[email protected]> wrote: > [Apologies for repost, for those who have seen this response already in the > dev mailing list] > > 1. When you set ssc.checkpoint(checkpointDir), the spark streaming > periodically saves the state RDD (which is a snapshot of all the state data) > to HDFS using RDD checkpointing. In fact, a streaming app with > updateStateByKey will not start until you set checkpoint directory. > > 2. The updateStateByKey performance is sort of independent of the what is the > source that is being use - receiver based or direct Kafka. The absolutely > performance obvious depends on a LOT of variables, size of the cluster, > parallelization, etc. The key things is that you must ensure sufficient > parallelization at every stage - receiving, shuffles (updateStateByKey > included), and output. > > Some more discussion in my talk - https://www.youtube.com/watch?v=d5UJonrruHk > > > > On Tue, Jul 14, 2015 at 4:13 PM, swetha <[email protected]> wrote: > > Hi, > > I have a question regarding sessionization using updateStateByKey. If near > real time state needs to be maintained in a Streaming application, what > happens when the number of RDDs to maintain the state becomes very large? > Does it automatically get saved to HDFS and reload when needed or do I have > to use any code like ssc.checkpoint(checkpointDir)? Also, how is the > performance if I use both DStream Checkpointing for maintaining the state > and use Kafka Direct approach for exactly once semantics? > > > Thanks, > Swetha > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Sessionization-using-updateStateByKey-tp23838.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
