More efficient RDD.count() implementation

OBones Wed, 12 Jul 2017 08:34:29 -0700

Hello,

As I have written my own data source, I also wrote a custom RDD[Row]implementation to provide getPartitions and compute overrides.This works very well but doing some performance analysis, I see that forany given pipeline fit operation, a fair amount of time is spent in theRDD.count method.Its default implementation in RDD.scala is to go through the entireiterator, which in my case is counter productive because I already knowthe number of rows there are in the RDD or any partition returned bygetPartitions.As an initial attempt, I declared the following in my custom RDDimplementation:


  override def count(): Long = { reader.RowCount }

but this never gets called which upon further inspection makes perfectsense. Indeed the internal code creates RDDs for every partition it hasto work on. And this is where I'm a bit stuck because I have no idea asto how to override this creation.

Here is a call stack for a GBTRegressor run, but it's quite similar forRandomForestRegressor or DecisionTreeRegressor.


org.apache.spark.rdd.RDD.count(RDD.scala:1158)
org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:116)
org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:125)
org.apache.spark.ml.tree.impl.GradientBoostedTrees$.boost(GradientBoostedTrees.scala:291)
org.apache.spark.ml.tree.impl.GradientBoostedTrees$.run(GradientBoostedTrees.scala:49)
org.apache.spark.ml.regression.GBTRegressor.train(GBTRegressor.scala:154)
org.apache.spark.ml.regression.GBTRegressor.train(GBTRegressor.scala:58)
org.apache.spark.ml.Predictor.fit(Predictor.scala:96)

Any suggestion would be much appreciated.

Regards

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

More efficient RDD.count() implementation

Reply via email to