Spark Streaming data checkpoint performance

Thúy Hằng Lê Sun, 01 Nov 2015 21:21:07 -0800

Hi Spark guru

I am evaluating Spark Streaming,


In my application I need to maintain cumulative statistics (e.g the total
running word count), so I need to call the updateStateByKey function on
very micro-batch.

After setting those things, I got following behaviors:
        * The Processing Time is very high every 10 seconds - usually 5x
higher (which I guess it's data checking point job)
        * The Processing Time becomes higher and higher over time, after 10
minutes it's much higher than the batch interval and lead to huge
Scheduling Delay and a lots Active Batches in queue.

My questions is:

         * Is this expected behavior? Is there any way to improve the
performance of data checking point?
         * How data checking point in Spark Streaming works? Does it need
to load all previous checking point data in order to build new one?

My job is very simple:

JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
Durations.seconds(2));
JavaPairInputDStream<String, String> messages =
KafkaUtils.createDirectStream(...);

JavaPairDStream<String, List<Double>> stats = messages.mapToPair(parseJson)
.reduceByKey(REDUCE_STATS) .updateStateByKey(RUNNING_STATS);

stats.print()

Spark Streaming data checkpoint performance

Reply via email to