You are correct, the default checkpointing interval is 10 seconds or your batch
size, whichever is bigger. You can change it by calling .checkpoint(x) on your
resulting Dstream.
For the rest, you are probably keeping an “all time” word count that grows
unbounded if you never remove words from the map. Keep in mind that
updateStateByKey is called for every key in the state RDD, regardless if you
have new occurrences or not.
You should consider at least one of these strategies:
* run your word count on a windowed Dstream (e.g. Unique counts over the
last 15 minutes)
* Your best bet here is reduceByKeyAndWindow with an inverse function
* Make your state object more complicated and try to prune out words with
very few occurrences or that haven’t been updated for a long time
* You can do this by emitting None from updateStateByKey
Hope this helps,
-adrian
From: Thúy Hằng Lê
Date: Monday, November 2, 2015 at 7:20 AM
To: "[email protected]<mailto:[email protected]>"
Subject: Spark Streaming data checkpoint performance
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.seconds(2));