Hi all,

I am running some benchmarks on a simple Spark application which consists of
:
- textFileStream() to extract text records from HDFS files
- map() to parse records into JSON objects
- updateStateByKey() to calculate and store an in-memory state for each key.

The processing time per batch gets slower as time passes and the number of
states increases, that is expected. 
However, we also notice spikes occuring at rather regular intervals. What
could cause those spikes ? We first suspected the GC, but the logs/metrics
don't seem to show any significant GC-related delays. Could this be related
to checkpointing ? Disk access latencies ?

I've attached a graph so you can visualize the problem (please ignore the
first spike which corresponds to system initialization) :

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n22375/Processing_Delay-page-001.jpg>
 

Thanks !



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Processing-Time-Spikes-Spark-Streaming-tp22375.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to