I'd like to use the SparkListenerInterface to listen for some metrics for monitoring/logging/metadata purposes. The first ones I'm interested in hooking into are recordsWritten and bytesWritten as a measure of throughput. I'm using PySpark to write Parquet files from DataFrames.
I'm able to extract a rich set of metrics this way, but for some reason the two that I want are always 0. This mirrors what I see in the Spark Application Master - the # records written field is always missing. I've filed a JIRA already for this issue: https://issues.apache.org/jira/browse/SPARK-22605 I _think_ how this works is that inside the ResultTask.runTask method, the rdd.iterator call is incrementing the bytes read & records read via RDD.getOrCompute. Where would the equivalent be for the records written metrics? These metrics are populated properly if I save the data as an RDD via df.rdd.saveAsTextFile, so the code path exists somewhere. Any hints as to where I should be looking? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org