Hi all, I am running some benchmarks on a simple Spark application which consists of : - textFileStream() to extract text records from HDFS files - map() to parse records into JSON objects - updateStateByKey() to calculate and store an in-memory state for each key.
The processing time per batch gets slower as time passes and the number of states increases, that is expected. However, we also notice spikes occuring at rather regular intervals. What could cause those spikes ? We first suspected the GC, but the logs/metrics don't seem to show any significant GC-related delays. Could this be related to checkpointing ? Disk access latencies ? I've attached a graph so you can visualize the problem (please ignore the first spike which corresponds to system initialization) : <http://apache-spark-user-list.1001560.n3.nabble.com/file/n22375/Processing_Delay-page-001.jpg> Thanks ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Processing-Time-Spikes-Spark-Streaming-tp22375.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org