Hi there,
We are using mllib 1.1.1, and doing Logistics Regression with a dataset of
about 150M rows.
The training part usually goes pretty smoothly without any retries. But
during the prediction stage and BinaryClassificationMetrics stage, I am
seeing retries with error of "fetch failure".
The prediction part is just as follows:
val predictionAndLabel = testRDD.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
...
val metrics = new BinaryClassificationMetrics(predictionAndLabel)
The fetch failure happened with the following stack trace:
org.apache.spark.rdd.PairRDDFunctions.combineByKey(PairRDDFunctions.scala:515)
org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3$lzycompute(BinaryClassificationMetrics.scala:101)
org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.x$3(BinaryClassificationMetrics.scala:96)
org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions$lzycompute(BinaryClassificationMetrics.scala:98)
org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.confusions(BinaryClassificationMetrics.scala:98)
org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.createCurve(BinaryClassificationMetrics.scala:142)
org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.roc(BinaryClassificationMetrics.scala:50)
org.apache.spark.mllib.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:60)
com.manage.ml.evaluation.BinaryClassificationMetrics.areaUnderROC(BinaryClassificationMetrics.scala:14)
...
We are doing this in the yarn-client mode. 32 executors, 16G executor
memory, and 12 cores as the spark-submit settings.
I wonder if anyone has suggestion on how to debug this.
thanks in advance
thomas