Ashutosh, The counter will certainly be an parellization issue when multiple nodes are used specially over massive datasets. A better approach would be to use some thing along these lines:
val index = sc.parallelize(Range.Long(0, rdd.count, 1), rdd.partitions.size) val rddWithIndex = rdd.zip(index) Which zips the two RDD's in a parallelizable fashion. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Contributing-Algorithm-for-Outlier-Detection-tp8880p9399.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org