rcv1.binary is too sparse (0.15% non-zero elements), so dense format will not run due to out of memory. But sparse format runs really well.
Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <dbt...@stanford.edu> wrote: > I'm doing the timer in runMiniBatchSGD after val numExamples = > data.count() > > See the following. Running rcv1 dataset now, and will update soon. > > val startTime = System.nanoTime() > for (i <- 1 to numIterations) { > // Sample a subset (fraction miniBatchFraction) of the total data > // compute and sum up the subgradients on this subset (this is one > map-reduce) > val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, > 42 + i) > .aggregate((BDV.zeros[Double](weights.size), 0.0))( > seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, > features)) => > val l = gradient.compute(features, label, weights, > Vectors.fromBreeze(grad)) > (grad, loss + l) > }, > combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), > (grad2, loss2)) => > (grad1 += grad2, loss1 + loss2) > }) > > /** > * NOTE(Xinghao): lossSum is computed using the weights from the > previous iteration > * and regVal is the regularization value computed in the previous > iteration as well. > */ > stochasticLossHistory.append(lossSum / miniBatchSize + regVal) > val update = updater.compute( > weights, Vectors.fromBreeze(gradientSum / miniBatchSize), > stepSize, i, regParam) > weights = update._1 > regVal = update._2 > timeStamp.append(System.nanoTime() - startTime) > } > > > > > > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng <men...@gmail.com> wrote: > >> I don't understand why sparse falls behind dense so much at the very >> first iteration. I didn't see count() is called in >> >> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala >> . Maybe you have local uncommitted changes. >> >> Best, >> Xiangrui >> >> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbt...@stanford.edu> wrote: >> > Hi Xiangrui, >> > >> > Yes, I'm using yarn-cluster mode, and I did check # of executors I >> specified >> > are the same as the actual running executors. >> > >> > For caching and materialization, I've the timer in optimizer after >> calling >> > count(); as a result, the time for materialization in cache isn't in the >> > benchmark. >> > >> > The difference you saw is actually from dense feature or sparse feature >> > vector. For LBFGS and GD dense feature, you can see the first iteration >> > takes the same time. It's true for GD. >> > >> > I'm going to run rcv1.binary which only has 0.15% non-zero elements to >> > verify the hypothesis. >> > >> > >> > Sincerely, >> > >> > DB Tsai >> > ------------------------------------------------------- >> > My Blog: https://www.dbtsai.com >> > LinkedIn: https://www.linkedin.com/in/dbtsai >> > >> > >> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <men...@gmail.com> >> wrote: >> >> >> >> Hi DB, >> >> >> >> I saw you are using yarn-cluster mode for the benchmark. I tested the >> >> yarn-cluster mode and found that YARN does not always give you the >> >> exact number of executors requested. Just want to confirm that you've >> >> checked the number of executors. >> >> >> >> The second thing to check is that in the benchmark code, after you >> >> call cache, you should also call count() to materialize the RDD. I saw >> >> in the result, the real difference is actually at the first step. >> >> Adding intercept is not a cheap operation for sparse vectors. >> >> >> >> Best, >> >> Xiangrui >> >> >> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <men...@gmail.com> >> wrote: >> >> > I don't think it is easy to make sparse faster than dense with this >> >> > sparsity and feature dimension. You can try rcv1.binary, which should >> >> > show the difference easily. >> >> > >> >> > David, the breeze operators used here are >> >> > >> >> > 1. DenseVector dot SparseVector >> >> > 2. axpy DenseVector SparseVector >> >> > >> >> > However, the SparseVector is passed in as Vector[Double] instead of >> >> > SparseVector[Double]. It might use the axpy impl of [DenseVector, >> >> > Vector] and call activeIterator. I didn't check whether you used >> >> > multimethods on axpy. >> >> > >> >> > Best, >> >> > Xiangrui >> >> > >> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbt...@stanford.edu> >> wrote: >> >> >> The figure showing the Log-Likelihood vs Time can be found here. >> >> >> >> >> >> >> >> >> >> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf >> >> >> >> >> >> Let me know if you can not open it. Thanks. >> >> >> >> >> >> Sincerely, >> >> >> >> >> >> DB Tsai >> >> >> ------------------------------------------------------- >> >> >> My Blog: https://www.dbtsai.com >> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >> >> >> >> >> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman >> >> >> <shiva...@eecs.berkeley.edu> wrote: >> >> >>> I don't think the attachment came through in the list. Could you >> >> >>> upload the >> >> >>> results somewhere and link to them ? >> >> >>> >> >> >>> >> >> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com> >> wrote: >> >> >>>> >> >> >>>> 123 features per rows, and in average, 89% are zeros. >> >> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <evan.spa...@gmail.com> >> wrote: >> >> >>>> >> >> >>>> > What is the number of non zeroes per row (and number of >> features) >> >> >>>> > in the >> >> >>>> > sparse case? We've hit some issues with breeze sparse support in >> >> >>>> > the >> >> >>>> > past >> >> >>>> > but for sufficiently sparse data it's still pretty good. >> >> >>>> > >> >> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu> >> wrote: >> >> >>>> > > >> >> >>>> > > Hi all, >> >> >>>> > > >> >> >>>> > > I'm benchmarking Logistic Regression in MLlib using the newly >> >> >>>> > > added >> >> >>>> > optimizer LBFGS and GD. I'm using the same dataset and the same >> >> >>>> > methodology >> >> >>>> > in this paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf >> >> >>>> > > >> >> >>>> > > I want to know how Spark scale while adding workers, and how >> >> >>>> > > optimizers >> >> >>>> > and input format (sparse or dense) impact performance. >> >> >>>> > > >> >> >>>> > > The benchmark code can be found here, >> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark >> >> >>>> > > >> >> >>>> > > The first dataset I benchmarked is a9a which only has 2.2MB. I >> >> >>>> > duplicated the dataset, and made it 762MB to have 11M rows. This >> >> >>>> > dataset >> >> >>>> > has 123 features and 11% of the data are non-zero elements. >> >> >>>> > > >> >> >>>> > > In this benchmark, all the dataset is cached in memory. >> >> >>>> > > >> >> >>>> > > As we expect, LBFGS converges faster than GD, and at some >> point, >> >> >>>> > > no >> >> >>>> > matter how we push GD, it will converge slower and slower. >> >> >>>> > > >> >> >>>> > > However, it's surprising that sparse format runs slower than >> >> >>>> > > dense >> >> >>>> > format. I did see that sparse format takes significantly smaller >> >> >>>> > amount >> >> >>>> > of >> >> >>>> > memory in caching RDD, but sparse is 40% slower than dense. I >> think >> >> >>>> > sparse >> >> >>>> > should be fast since when we compute x wT, since x is sparse, we >> >> >>>> > can do >> >> >>>> > it >> >> >>>> > faster. I wonder if there is anything I'm doing wrong. >> >> >>>> > > >> >> >>>> > > The attachment is the benchmark result. >> >> >>>> > > >> >> >>>> > > Thanks. >> >> >>>> > > >> >> >>>> > > Sincerely, >> >> >>>> > > >> >> >>>> > > DB Tsai >> >> >>>> > > ------------------------------------------------------- >> >> >>>> > > My Blog: https://www.dbtsai.com >> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >>>> > >> >> >>> >> >> >>> >> > >> > >> > >