Re: news20-binary classification with LogisticRegressionWithSGD

Makoto Yui Tue, 17 Jun 2014 13:34:31 -0700

Hi Xiangrui,

(2014/06/18 4:58), Xiangrui Meng wrote:

How many partitions did you set? If there are too many partitions,
please do a coalesce before calling ML algorithms.

The training data "news20.random.1000" is small and thus only 2partitions are used by the default.

val training = MLUtils.loadLibSVMFile(sc,"hdfs://host:8020/dataset/news20-binary/news20.random.1000",multiclass=false).


We also tried 32 partitions as follows but the aggregate never finishes.

val training = MLUtils.loadLibSVMFile(sc,"hdfs://host:8020/dataset/news20-binary/news20.random.1000",multiclass=false, numFeatures = 1354731 , minPartitions = 32)

Btw, could you try the tree branch in my repo?
https://github.com/mengxr/spark/tree/tree

I used tree aggregate in this branch. It should help with the scalability.


Is treeAggregate itself available on Spark 1.0?

I wonder.. Could I test your modification just by running the followingcode on REPL?


-------------------
val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i)
        .treeAggregate((BDV.zeros[Double](weights.size), 0.0))(

seqOp = (c, v) => (c, v) match { case ((grad, loss), (label,features)) =>val l = gradient.compute(features, label, weights,Vectors.fromBreeze(grad))

            (grad, loss + l)
          },

combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1),(grad2, loss2)) =>

            (grad1 += grad2, loss1 + loss2)
          }, 2)
-------------------------

Rebuilding Spark is quite something to do evaluation.

Thanks,
Makoto

Re: news20-binary classification with LogisticRegressionWithSGD

Reply via email to