Hi Spark guru
I am evaluating Spark Streaming,
In my application I need to maintain cumulative statistics (e.g the total
running word count), so I need to call the updateStateByKey function on
very micro-batch.
After setting those things, I got following behaviors:
* The Processing Time is very high every 10 seconds - usually 5x
higher (which I guess it's data checking point job)
* The Processing Time becomes higher and higher over time, after 10
minutes it's much higher than the batch interval and lead to huge
Scheduling Delay and a lots Active Batches in queue.
My questions is:
* Is this expected behavior? Is there any way to improve the
performance of data checking point?
* How data checking point in Spark Streaming works? Does it need
to load all previous checking point data in order to build new one?
My job is very simple:
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.seconds(2));
JavaPairInputDStream<String, String> messages =
KafkaUtils.createDirectStream(...);
JavaPairDStream<String, List<Double>> stats = messages.mapToPair(parseJson)
.reduceByKey(REDUCE_STATS) .updateStateByKey(RUNNING_STATS);
stats.print()