Have you considered using the Spark Web UI to view progress on your job? It does a very good job showing the progress of the overall job, as well as allows you to drill into the individual tasks and server activity.
On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > How can I get progress information of a RDD operation? For example > > > > *val *lines = sc.textFile(*"c:/temp/input.txt"*) // a RDD of millions of > line > lines.foreach(line => { > handleLine(line) > }) > > The input.txt contains millions of lines. The entire operation take 6 > hours. I want to print out how many lines are processed every 1 minute so > user know the progress. How can I do that? > > > > One way I am thinking of is to use accumulator, e.g. > > > > > > *val *lines = sc.textFile(*"c:/temp/input.txt"*) > *val *acCount = sc.accumulator(0L) > lines.foreach(line => { > handleLine(line) > acCount += 1 > } > > However how can I print out account every 1 minutes? > > > > > > Ningjun > > >