Hi Xiangrui,
(2014/06/18 4:58), Xiangrui Meng wrote:
How many partitions did you set? If there are too many partitions,
please do a coalesce before calling ML algorithms.
The training data "news20.random.1000" is small and thus only 2
partitions are used by the default.
val training = MLUtils.loadLibSVMFile(sc,
"hdfs://host:8020/dataset/news20-binary/news20.random.1000",
multiclass=false).
We also tried 32 partitions as follows but the aggregate never finishes.
val training = MLUtils.loadLibSVMFile(sc,
"hdfs://host:8020/dataset/news20-binary/news20.random.1000",
multiclass=false, numFeatures = 1354731 , minPartitions = 32)
Btw, could you try the tree branch in my repo?
https://github.com/mengxr/spark/tree/tree
I used tree aggregate in this branch. It should help with the scalability.
Is treeAggregate itself available on Spark 1.0?
I wonder.. Could I test your modification just by running the following
code on REPL?
-------------------
val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i)
.treeAggregate((BDV.zeros[Double](weights.size), 0.0))(
seqOp = (c, v) => (c, v) match { case ((grad, loss), (label,
features)) =>
val l = gradient.compute(features, label, weights,
Vectors.fromBreeze(grad))
(grad, loss + l)
},
combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1),
(grad2, loss2)) =>
(grad1 += grad2, loss1 + loss2)
}, 2)
-------------------------
Rebuilding Spark is quite something to do evaluation.
Thanks,
Makoto