Sorry - just saw the 11% number. That is around the spot where dense data is usually faster (blocking, cache coherence, etc) is there any chance you have a 1% (or so) sparse dataset to experiment with?
> On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu> wrote: > > Hi all, > > I'm benchmarking Logistic Regression in MLlib using the newly added optimizer > LBFGS and GD. I'm using the same dataset and the same methodology in this > paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf > > I want to know how Spark scale while adding workers, and how optimizers and > input format (sparse or dense) impact performance. > > The benchmark code can be found here, > https://github.com/dbtsai/spark-lbfgs-benchmark > > The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the > dataset, and made it 762MB to have 11M rows. This dataset has 123 features > and 11% of the data are non-zero elements. > > In this benchmark, all the dataset is cached in memory. > > As we expect, LBFGS converges faster than GD, and at some point, no matter > how we push GD, it will converge slower and slower. > > However, it's surprising that sparse format runs slower than dense format. I > did see that sparse format takes significantly smaller amount of memory in > caching RDD, but sparse is 40% slower than dense. I think sparse should be > fast since when we compute x wT, since x is sparse, we can do it faster. I > wonder if there is anything I'm doing wrong. > > The attachment is the benchmark result. > > Thanks. > > Sincerely, > > DB Tsai > ------------------------------------------------------- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai