Here is follow-up to the previous evaluation.

"aggregate at GradientDescent.scala:178" never finishes at
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L178

We confirmed, by -verbose:gc, that GC is not happening during the aggregate
and the cumulative CPU time for the task is increasing little by little.

LBFGS also does not work for large # of features (news20.random.1000)
though it works fine for small # of features (news20.binary.1000).

"aggregate at LBFGS.scala:201" also never finishes at
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L201

-----------------------------------------------------------------------
[Evaluated code for LBFGS]

import org.apache.spark.SparkContext
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.classification.LogisticRegressionModel
import org.apache.spark.mllib.optimization._

val data = MLUtils.loadLibSVMFile(sc,
"hdfs://dm01:8020/dataset/news20-binary/news20.random.1000",
multiclass=false)
val numFeatures = data.take(1)(0).features.size

val training = data.map(x => (x.label, MLUtils.appendBias(x.features))).cache()

// Run training algorithm to build the model
val numCorrections = 10
val convergenceTol = 1e-4
val maxNumIterations = 20
val regParam = 0.1
val initialWeightsWithIntercept = Vectors.dense(new
Array[Double](numFeatures + 1))

val (weightsWithIntercept, loss) = LBFGS.runLBFGS(
  training,
  new LogisticGradient(),
  new SquaredL2Updater(),
  numCorrections,
  convergenceTol,
  maxNumIterations,
  regParam,
  initialWeightsWithIntercept)
-----------------------------------------------------------------------


Thanks,
Makoto

2014-06-17 21:32 GMT+09:00 Makoto Yui <yuin...@gmail.com>:
> Hello,
>
> I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on
> Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though
> the number of training examples used in the evaluation is just 1,000.
>
> It works fine for the dataset *news20.binary.1000* that has 178,560
> features. However, it does not work for *news20.random.1000* where # of
> features is large  (1,354,731 features) though we used a sparse vector
> through MLUtils.loadLibSVMFile().
>
> The execution seems not progressing while no error is reported in the
> spark-shell as well as in the stdout/stderr of executors.
>
> We used 32 executors with each allocating 7GB (2GB is for RDD) for
> working memory.
>
> Any suggesions? Your help is really appreciated.
>
> ==============
> Executed code
> ==============
> import org.apache.spark.mllib.util.MLUtils
> import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
>
> //val training = MLUtils.loadLibSVMFile(sc,
> "hdfs://host:8020/dataset/news20-binary/news20.binary.1000",
> multiclass=false)
> val training = MLUtils.loadLibSVMFile(sc,
> "hdfs://host:8020/dataset/news20-binary/news20.random.1000",
> multiclass=false)
>
> val numFeatures = training .take(1)(0).features.size
> //numFeatures: Int = 178560 for news20.binary.1000
> //numFeatures: Int = 1354731 for news20.random.1000
> val model = LogisticRegressionWithSGD.train(training, numIterations=1)
>
> ==================================
> The dataset used in the evaluation
> ==================================
>
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary
>
> $ head -1000 news20.binary | sed 's/+1/1/g' | sed 's/-1/0/g' >
> news20.binary.1000
> $ sort -R news20.binary > news20.random
> $ head -1000 news20.random | sed 's/+1/1/g' | sed 's/-1/0/g' >
> news20.random.1000
>
> You can find the dataset in
> https://dl.dropboxusercontent.com/u/13123103/news20.random.1000
> https://dl.dropboxusercontent.com/u/13123103/news20.binary.1000
>
>
> Thanks,
> Makoto

Reply via email to