Sorry - just saw the 11% number. That is around the spot where dense data is 
usually faster (blocking, cache coherence, etc) is there any chance you have a 
1% (or so) sparse dataset to experiment with?

> On Apr 23, 2014, at 9:21 PM, DB Tsai <dbt...@stanford.edu> wrote:
> 
> Hi all,
> 
> I'm benchmarking Logistic Regression in MLlib using the newly added optimizer 
> LBFGS and GD. I'm using the same dataset and the same methodology in this 
> paper, http://www.csie.ntu.edu.tw/~cjlin/papers/l1.pdf
> 
> I want to know how Spark scale while adding workers, and how optimizers and 
> input format (sparse or dense) impact performance. 
> 
> The benchmark code can be found here, 
> https://github.com/dbtsai/spark-lbfgs-benchmark
> 
> The first dataset I benchmarked is a9a which only has 2.2MB. I duplicated the 
> dataset, and made it 762MB to have 11M rows. This dataset has 123 features 
> and 11% of the data are non-zero elements. 
> 
> In this benchmark, all the dataset is cached in memory.
> 
> As we expect, LBFGS converges faster than GD, and at some point, no matter 
> how we push GD, it will converge slower and slower. 
> 
> However, it's surprising that sparse format runs slower than dense format. I 
> did see that sparse format takes significantly smaller amount of memory in 
> caching RDD, but sparse is 40% slower than dense. I think sparse should be 
> fast since when we compute x wT, since x is sparse, we can do it faster. I 
> wonder if there is anything I'm doing wrong. 
> 
> The attachment is the benchmark result.
> 
> Thanks.  
> 
> Sincerely,
> 
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai

Reply via email to