Re: Associating user objects with SparkContext/SparkStreamingContext

2016-06-24 Thread Evan Sparks
I would actually think about this the other way around. Move the functions you are passing to the streaming jobs out to their own object if possible. Spark's closure capture rules are necessarily far reaching and serialize the object that contains these methods, which is a common cause of the pr

Re: Gradient Descent with large model size

2015-10-17 Thread Evan Sparks
Yes, remember that your bandwidth is the maximum number of bytes per second that can be shipped to the driver. So if you've got 5 blocks that size, then it looks like you're basically saturating the network. Aggregation trees help for many partitions/nodes and butterfly mixing can help use all

Re: Decision forests don't work with non-trivial categorical features

2014-10-12 Thread Evan Sparks
I was under the impression that we were using the usual sort by average response value heuristic when storing histogram bins (and searching for optimal splits) in the tree code. Maybe Manish or Joseph can clarify? > On Oct 12, 2014, at 2:50 PM, Sean Owen wrote: > > I'm having trouble getting

Re: What is the best way to build my developing Spark for testing on EC2?

2014-10-02 Thread Evan Sparks
I recommend using the data generators provided with MLlib to generate synthetic data for your scalability tests - provided they're well suited for your algorithms. They let you control things like number of examples and dimensionality of your dataset, as well as number of partitions. As far as

Re: Linear CG solver

2014-06-28 Thread Evan Sparks
Hey, We're actually working on similar ideas in the AMPlab with spark - for example we've got some image classification pipelines built on this idea - http://www.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf Approximating kernel methods via random projections hit with nonlinearity. Add

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread Evan Sparks
Sorry - just saw the 11% number. That is around the spot where dense data is usually faster (blocking, cache coherence, etc) is there any chance you have a 1% (or so) sparse dataset to experiment with? > On Apr 23, 2014, at 9:21 PM, DB Tsai wrote: > > Hi all, > > I'm benchmarking Logistic Reg

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread Evan Sparks
What is the number of non zeroes per row (and number of features) in the sparse case? We've hit some issues with breeze sparse support in the past but for sufficiently sparse data it's still pretty good. > On Apr 23, 2014, at 9:21 PM, DB Tsai wrote: > > Hi all, > > I'm benchmarking Logistic