Here is follow-up to the previous evaluation. "aggregate at GradientDescent.scala:178" never finishes at https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L178
We confirmed, by -verbose:gc, that GC is not happening during the aggregate and the cumulative CPU time for the task is increasing little by little. LBFGS also does not work for large # of features (news20.random.1000) though it works fine for small # of features (news20.binary.1000). "aggregate at LBFGS.scala:201" also never finishes at https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L201 ----------------------------------------------------------------------- [Evaluated code for LBFGS] import org.apache.spark.SparkContext import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.classification.LogisticRegressionModel import org.apache.spark.mllib.optimization._ val data = MLUtils.loadLibSVMFile(sc, "hdfs://dm01:8020/dataset/news20-binary/news20.random.1000", multiclass=false) val numFeatures = data.take(1)(0).features.size val training = data.map(x => (x.label, MLUtils.appendBias(x.features))).cache() // Run training algorithm to build the model val numCorrections = 10 val convergenceTol = 1e-4 val maxNumIterations = 20 val regParam = 0.1 val initialWeightsWithIntercept = Vectors.dense(new Array[Double](numFeatures + 1)) val (weightsWithIntercept, loss) = LBFGS.runLBFGS( training, new LogisticGradient(), new SquaredL2Updater(), numCorrections, convergenceTol, maxNumIterations, regParam, initialWeightsWithIntercept) ----------------------------------------------------------------------- Thanks, Makoto 2014-06-17 21:32 GMT+09:00 Makoto Yui <yuin...@gmail.com>: > Hello, > > I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on > Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though > the number of training examples used in the evaluation is just 1,000. > > It works fine for the dataset *news20.binary.1000* that has 178,560 > features. However, it does not work for *news20.random.1000* where # of > features is large (1,354,731 features) though we used a sparse vector > through MLUtils.loadLibSVMFile(). > > The execution seems not progressing while no error is reported in the > spark-shell as well as in the stdout/stderr of executors. > > We used 32 executors with each allocating 7GB (2GB is for RDD) for > working memory. > > Any suggesions? Your help is really appreciated. > > ============== > Executed code > ============== > import org.apache.spark.mllib.util.MLUtils > import org.apache.spark.mllib.classification.LogisticRegressionWithSGD > > //val training = MLUtils.loadLibSVMFile(sc, > "hdfs://host:8020/dataset/news20-binary/news20.binary.1000", > multiclass=false) > val training = MLUtils.loadLibSVMFile(sc, > "hdfs://host:8020/dataset/news20-binary/news20.random.1000", > multiclass=false) > > val numFeatures = training .take(1)(0).features.size > //numFeatures: Int = 178560 for news20.binary.1000 > //numFeatures: Int = 1354731 for news20.random.1000 > val model = LogisticRegressionWithSGD.train(training, numIterations=1) > > ================================== > The dataset used in the evaluation > ================================== > > http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary > > $ head -1000 news20.binary | sed 's/+1/1/g' | sed 's/-1/0/g' > > news20.binary.1000 > $ sort -R news20.binary > news20.random > $ head -1000 news20.random | sed 's/+1/1/g' | sed 's/-1/0/g' > > news20.random.1000 > > You can find the dataset in > https://dl.dropboxusercontent.com/u/13123103/news20.random.1000 > https://dl.dropboxusercontent.com/u/13123103/news20.binary.1000 > > > Thanks, > Makoto