Hi Xiangrui,

(2014/06/18 4:58), Xiangrui Meng wrote:
How many partitions did you set? If there are too many partitions,
please do a coalesce before calling ML algorithms.

The training data "news20.random.1000" is small and thus only 2 partitions are used by the default.

val training = MLUtils.loadLibSVMFile(sc, "hdfs://host:8020/dataset/news20-binary/news20.random.1000", multiclass=false).

We also tried 32 partitions as follows but the aggregate never finishes.

val training = MLUtils.loadLibSVMFile(sc, "hdfs://host:8020/dataset/news20-binary/news20.random.1000", multiclass=false, numFeatures = 1354731 , minPartitions = 32)

Btw, could you try the tree branch in my repo?
https://github.com/mengxr/spark/tree/tree

I used tree aggregate in this branch. It should help with the scalability.

Is treeAggregate itself available on Spark 1.0?

I wonder.. Could I test your modification just by running the following code on REPL?

-------------------
val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i)
        .treeAggregate((BDV.zeros[Double](weights.size), 0.0))(
seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) => val l = gradient.compute(features, label, weights, Vectors.fromBreeze(grad))
            (grad, loss + l)
          },
combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) =>
            (grad1 += grad2, loss1 + loss2)
          }, 2)
-------------------------

Rebuilding Spark is quite something to do evaluation.

Thanks,
Makoto

Reply via email to