Hello,

As I have written my own data source, I also wrote a custom RDD[Row] implementation to provide getPartitions and compute overrides. This works very well but doing some performance analysis, I see that for any given pipeline fit operation, a fair amount of time is spent in the RDD.count method. Its default implementation in RDD.scala is to go through the entire iterator, which in my case is counter productive because I already know the number of rows there are in the RDD or any partition returned by getPartitions. As an initial attempt, I declared the following in my custom RDD implementation:

  override def count(): Long = { reader.RowCount }

but this never gets called which upon further inspection makes perfect sense. Indeed the internal code creates RDDs for every partition it has to work on. And this is where I'm a bit stuck because I have no idea as to how to override this creation.

Here is a call stack for a GBTRegressor run, but it's quite similar for RandomForestRegressor or DecisionTreeRegressor.

org.apache.spark.rdd.RDD.count(RDD.scala:1158)
org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:116)
org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:125)
org.apache.spark.ml.tree.impl.GradientBoostedTrees$.boost(GradientBoostedTrees.scala:291)
org.apache.spark.ml.tree.impl.GradientBoostedTrees$.run(GradientBoostedTrees.scala:49)
org.apache.spark.ml.regression.GBTRegressor.train(GBTRegressor.scala:154)
org.apache.spark.ml.regression.GBTRegressor.train(GBTRegressor.scala:58)
org.apache.spark.ml.Predictor.fit(Predictor.scala:96)

Any suggestion would be much appreciated.

Regards

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to