Also, how many failure of rejection will terminate the optimization process? How is it related to "numberOfImprovementFailures"?
Thanks. Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Sun, Apr 27, 2014 at 11:28 PM, DB Tsai <dbt...@stanford.edu> wrote: > Hi David, > > I'm recording the loss history in the DiffFunction implementation, and > that's why the rejected step is also recorded in my loss history. > > Is there any api in Breeze LBFGS to get the history which already excludes > the reject step? Or should I just call "iterations" method and check > "iteratingShouldStop" instead? > > Thanks. > > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Fri, Apr 25, 2014 at 3:10 PM, David Hall <d...@cs.berkeley.edu> wrote: > >> LBFGS will not take a step that sends the objective value up. It might >> try a step that is "too big" and reject it, so if you're just logging >> everything that gets tried by LBFGS, you could see that. The "iterations" >> method of the minimizer should never return an increasing objective value. >> If you're regularizing, are you including the regularizer in the objective >> value computation? >> >> GD is almost never worth your time. >> >> -- David >> >> On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai <dbt...@stanford.edu> wrote: >> >>> Another interesting benchmark. >>> >>> *News20 dataset - 0.14M row, 1,355,191 features, 0.034% non-zero >>> elements.* >>> >>> LBFGS converges in 70 seconds, while GD seems to be not progressing. >>> >>> Dense feature vector will be too big to fit in the memory, so only >>> conduct the sparse benchmark. >>> >>> I saw the sometimes the loss bumps up, and it's weird for me. Since the >>> cost function of logistic regression is convex, it should be monotonically >>> decreasing. David, any suggestion? >>> >>> The detail figure: >>> >>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/news20.pdf >>> >>> >>> *Rcv1 dataset - 6.8M row, 677,399 features, 0.15% non-zero elements.* >>> >>> LBFGS converges in 25 seconds, while GD also seems to be not >>> progressing. >>> >>> Only conduct sparse benchmark for the same reason. I also saw the loss >>> bumps up for unknown reason. >>> >>> The detail figure: >>> >>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/0b774682e398b4f7e0ce01a69c44000eb0e73454/result/rcv1.pdf >>> >>> >>> Sincerely, >>> >>> DB Tsai >>> ------------------------------------------------------- >>> My Blog: https://www.dbtsai.com >>> LinkedIn: https://www.linkedin.com/in/dbtsai >>> >>> >>> On Thu, Apr 24, 2014 at 2:36 PM, DB Tsai <dbt...@stanford.edu> wrote: >>> >>>> rcv1.binary is too sparse (0.15% non-zero elements), so dense format >>>> will not run due to out of memory. But sparse format runs really well. >>>> >>>> >>>> Sincerely, >>>> >>>> DB Tsai >>>> ------------------------------------------------------- >>>> My Blog: https://www.dbtsai.com >>>> LinkedIn: https://www.linkedin.com/in/dbtsai >>>> >>>> >>>> On Thu, Apr 24, 2014 at 1:54 PM, DB Tsai <dbt...@stanford.edu> wrote: >>>> >>>>> I'm doing the timer in runMiniBatchSGD after val numExamples = >>>>> data.count() >>>>> >>>>> See the following. Running rcv1 dataset now, and will update soon. >>>>> >>>>> val startTime = System.nanoTime() >>>>> for (i <- 1 to numIterations) { >>>>> // Sample a subset (fraction miniBatchFraction) of the total data >>>>> // compute and sum up the subgradients on this subset (this is >>>>> one map-reduce) >>>>> val (gradientSum, lossSum) = data.sample(false, >>>>> miniBatchFraction, 42 + i) >>>>> .aggregate((BDV.zeros[Double](weights.size), 0.0))( >>>>> seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, >>>>> features)) => >>>>> val l = gradient.compute(features, label, weights, >>>>> Vectors.fromBreeze(grad)) >>>>> (grad, loss + l) >>>>> }, >>>>> combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), >>>>> (grad2, loss2)) => >>>>> (grad1 += grad2, loss1 + loss2) >>>>> }) >>>>> >>>>> /** >>>>> * NOTE(Xinghao): lossSum is computed using the weights from the >>>>> previous iteration >>>>> * and regVal is the regularization value computed in the >>>>> previous iteration as well. >>>>> */ >>>>> stochasticLossHistory.append(lossSum / miniBatchSize + regVal) >>>>> val update = updater.compute( >>>>> weights, Vectors.fromBreeze(gradientSum / miniBatchSize), >>>>> stepSize, i, regParam) >>>>> weights = update._1 >>>>> regVal = update._2 >>>>> timeStamp.append(System.nanoTime() - startTime) >>>>> } >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Sincerely, >>>>> >>>>> DB Tsai >>>>> ------------------------------------------------------- >>>>> My Blog: https://www.dbtsai.com >>>>> LinkedIn: https://www.linkedin.com/in/dbtsai >>>>> >>>>> >>>>> On Thu, Apr 24, 2014 at 1:44 PM, Xiangrui Meng <men...@gmail.com>wrote: >>>>> >>>>>> I don't understand why sparse falls behind dense so much at the very >>>>>> first iteration. I didn't see count() is called in >>>>>> >>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala >>>>>> . Maybe you have local uncommitted changes. >>>>>> >>>>>> Best, >>>>>> Xiangrui >>>>>> >>>>>> On Thu, Apr 24, 2014 at 11:26 AM, DB Tsai <dbt...@stanford.edu> >>>>>> wrote: >>>>>> > Hi Xiangrui, >>>>>> > >>>>>> > Yes, I'm using yarn-cluster mode, and I did check # of executors I >>>>>> specified >>>>>> > are the same as the actual running executors. >>>>>> > >>>>>> > For caching and materialization, I've the timer in optimizer after >>>>>> calling >>>>>> > count(); as a result, the time for materialization in cache isn't >>>>>> in the >>>>>> > benchmark. >>>>>> > >>>>>> > The difference you saw is actually from dense feature or sparse >>>>>> feature >>>>>> > vector. For LBFGS and GD dense feature, you can see the first >>>>>> iteration >>>>>> > takes the same time. It's true for GD. >>>>>> > >>>>>> > I'm going to run rcv1.binary which only has 0.15% non-zero elements >>>>>> to >>>>>> > verify the hypothesis. >>>>>> > >>>>>> > >>>>>> > Sincerely, >>>>>> > >>>>>> > DB Tsai >>>>>> > ------------------------------------------------------- >>>>>> > My Blog: https://www.dbtsai.com >>>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>> > >>>>>> > >>>>>> > On Thu, Apr 24, 2014 at 1:09 AM, Xiangrui Meng <men...@gmail.com> >>>>>> wrote: >>>>>> >> >>>>>> >> Hi DB, >>>>>> >> >>>>>> >> I saw you are using yarn-cluster mode for the benchmark. I tested >>>>>> the >>>>>> >> yarn-cluster mode and found that YARN does not always give you the >>>>>> >> exact number of executors requested. Just want to confirm that >>>>>> you've >>>>>> >> checked the number of executors. >>>>>> >> >>>>>> >> The second thing to check is that in the benchmark code, after you >>>>>> >> call cache, you should also call count() to materialize the RDD. I >>>>>> saw >>>>>> >> in the result, the real difference is actually at the first step. >>>>>> >> Adding intercept is not a cheap operation for sparse vectors. >>>>>> >> >>>>>> >> Best, >>>>>> >> Xiangrui >>>>>> >> >>>>>> >> On Thu, Apr 24, 2014 at 12:53 AM, Xiangrui Meng <men...@gmail.com> >>>>>> wrote: >>>>>> >> > I don't think it is easy to make sparse faster than dense with >>>>>> this >>>>>> >> > sparsity and feature dimension. You can try rcv1.binary, which >>>>>> should >>>>>> >> > show the difference easily. >>>>>> >> > >>>>>> >> > David, the breeze operators used here are >>>>>> >> > >>>>>> >> > 1. DenseVector dot SparseVector >>>>>> >> > 2. axpy DenseVector SparseVector >>>>>> >> > >>>>>> >> > However, the SparseVector is passed in as Vector[Double] instead >>>>>> of >>>>>> >> > SparseVector[Double]. It might use the axpy impl of [DenseVector, >>>>>> >> > Vector] and call activeIterator. I didn't check whether you used >>>>>> >> > multimethods on axpy. >>>>>> >> > >>>>>> >> > Best, >>>>>> >> > Xiangrui >>>>>> >> > >>>>>> >> > On Wed, Apr 23, 2014 at 10:35 PM, DB Tsai <dbt...@stanford.edu> >>>>>> wrote: >>>>>> >> >> The figure showing the Log-Likelihood vs Time can be found here. >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result/a9a11M.pdf >>>>>> >> >> >>>>>> >> >> Let me know if you can not open it. Thanks. >>>>>> >> >> >>>>>> >> >> Sincerely, >>>>>> >> >> >>>>>> >> >> DB Tsai >>>>>> >> >> ------------------------------------------------------- >>>>>> >> >> My Blog: https://www.dbtsai.com >>>>>> >> >> LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> On Wed, Apr 23, 2014 at 9:34 PM, Shivaram Venkataraman >>>>>> >> >> <shiva...@eecs.berkeley.edu> wrote: >>>>>> >> >>> I don't think the attachment came through in the list. Could >>>>>> you >>>>>> >> >>> upload the >>>>>> >> >>> results somewhere and link to them ? >>>>>> >> >>> >>>>>> >> >>> >>>>>> >> >>> On Wed, Apr 23, 2014 at 9:32 PM, DB Tsai <dbt...@dbtsai.com> >>>>>> wrote: >>>>>> >> >>>> >>>>>> >> >>>> 123 features per rows, and in average, 89% are zeros. >>>>>> >> >>>> On Apr 23, 2014 9:31 PM, "Evan Sparks" <evan.spa...@gmail.com> >>>>>> wrote: >>>>>> >> >>>> >>>>>> >> >>>> > What is the number of non zeroes per row (and number of >>>>>> features) >>>>>> >> >>>> > in the >>>>>> >> >>>> > sparse case? We've hit some issues with breeze sparse >>>>>> support in >>>>>> >> >>>> > the >>>>>> >> >>>> > past >>>>>> >> >>>> > but for sufficiently sparse data it's still pretty good. >>>>>> >> >>>> > >>>>>> >> >>>> > > On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu> >>>>>> wrote: >>>>>> >> >>>> > > >>>>>> >> >>>> > > Hi all, >>>>>> >> >>>> > > >>>>>> >> >>>> > > I'm benchmarking Logistic Regression in MLlib using the >>>>>> newly >>>>>> >> >>>> > > added >>>>>> >> >>>> > optimizer LBFGS and GD. I'm using the same dataset and the >>>>>> same >>>>>> >> >>>> > methodology >>>>>> >> >>>> > in this paper, >>>>>> http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf >>>>>> >> >>>> > > >>>>>> >> >>>> > > I want to know how Spark scale while adding workers, and >>>>>> how >>>>>> >> >>>> > > optimizers >>>>>> >> >>>> > and input format (sparse or dense) impact performance. >>>>>> >> >>>> > > >>>>>> >> >>>> > > The benchmark code can be found here, >>>>>> >> >>>> > https://github.com/dbtsai/spark-lbfgs-benchmark >>>>>> >> >>>> > > >>>>>> >> >>>> > > The first dataset I benchmarked is a9a which only has >>>>>> 2.2MB. I >>>>>> >> >>>> > duplicated the dataset, and made it 762MB to have 11M rows. >>>>>> This >>>>>> >> >>>> > dataset >>>>>> >> >>>> > has 123 features and 11% of the data are non-zero elements. >>>>>> >> >>>> > > >>>>>> >> >>>> > > In this benchmark, all the dataset is cached in memory. >>>>>> >> >>>> > > >>>>>> >> >>>> > > As we expect, LBFGS converges faster than GD, and at some >>>>>> point, >>>>>> >> >>>> > > no >>>>>> >> >>>> > matter how we push GD, it will converge slower and slower. >>>>>> >> >>>> > > >>>>>> >> >>>> > > However, it's surprising that sparse format runs slower >>>>>> than >>>>>> >> >>>> > > dense >>>>>> >> >>>> > format. I did see that sparse format takes significantly >>>>>> smaller >>>>>> >> >>>> > amount >>>>>> >> >>>> > of >>>>>> >> >>>> > memory in caching RDD, but sparse is 40% slower than dense. >>>>>> I think >>>>>> >> >>>> > sparse >>>>>> >> >>>> > should be fast since when we compute x wT, since x is >>>>>> sparse, we >>>>>> >> >>>> > can do >>>>>> >> >>>> > it >>>>>> >> >>>> > faster. I wonder if there is anything I'm doing wrong. >>>>>> >> >>>> > > >>>>>> >> >>>> > > The attachment is the benchmark result. >>>>>> >> >>>> > > >>>>>> >> >>>> > > Thanks. >>>>>> >> >>>> > > >>>>>> >> >>>> > > Sincerely, >>>>>> >> >>>> > > >>>>>> >> >>>> > > DB Tsai >>>>>> >> >>>> > > ------------------------------------------------------- >>>>>> >> >>>> > > My Blog: https://www.dbtsai.com >>>>>> >> >>>> > > LinkedIn: https://www.linkedin.com/in/dbtsai >>>>>> >> >>>> > >>>>>> >> >>> >>>>>> >> >>> >>>>>> > >>>>>> > >>>>>> >>>>> >>>>> >>>> >>> >> >