How can I get progress information of a RDD operation? For example
val lines = sc.textFile("c:/temp/input.txt") // a RDD of millions of line
lines.foreach(line => {
handleLine(line)
})
The input.txt contains millions of lines. The entire operation take 6 hours. I
want to print out how many lines are processed every 1 minute so user know the
progress. How can I do that?
One way I am thinking of is to use accumulator, e.g.
val lines = sc.textFile("c:/temp/input.txt")
val acCount = sc.accumulator(0L)
lines.foreach(line => {
handleLine(line)
acCount += 1
}
However how can I print out account every 1 minutes?
Ningjun