Hello,

I was looking at Spark streaming UI and noticed a big difference between
"Processing time" and "Job duration"

[image: Inline image 1]

Processing time/Output Op duration is show as 50s but sum of all job
duration is ~25s.
What is causing this difference? Based on logs I know that the batch
actually took 50s.

[image: Inline image 2]

The job that is taking most of time is
    joinRDD.toDS()
           .write.format("com.databricks.spark.csv")
           .mode(SaveMode.Append)
           .options(Map("mode" -> "DROPMALFORMED", "delimiter" -> "\t",
"header" -> "false"))
           .partitionBy("entityId", "regionId", "eventDate")
           .save(outputPath)

Removing SaveMode.Append really speeds things up and also the mismatch
between Job duration and processing time disappears.
I'm not able to explain what is causing this though.

Srikanth

Reply via email to