Re: MLBase status

2014-08-27 Thread Ameet Talwalkar
Hi Sameer, MLbase started out as a set of three ML components on top of Spark. The lowest level, MLlib, is now a rapidly growing component within Spark and is maintained by the Spark community. The two higher-level components (MLI and MLOpt) are experimental components that serve as testbeds for

Re: Spark MLlib vs BIDMach Benchmark

2014-07-27 Thread Ameet Talwalkar
To add to the last point, multimodel training is something we've explored as part of the MLbase Optimizer, and we've seen some nice speedups. This feature will be added to MLlib soon (not sure if it'll make it into the 1.1 release though). On Sat, Jul 26, 2014 at 11:27 PM, Matei Zaharia wrote:

Re: Gradient Boosting Decision Trees

2014-07-16 Thread Ameet Talwalkar
Hi Pedro, Yes, although they will probably not be included in the next release (since the code freeze is ~2 weeks away), GBM (and other ensembles of decision trees) are currently under active development. We're hoping they'll make it into the subsequent release. -Ameet On Wed, Jul 16, 2014 at

Re: MLlib feature request

2014-07-11 Thread Ameet Talwalkar
Hi Joseph, Thanks for your email. Many users are requesting this functionality, while it would be a stretch for them to appear in Spark 1.1, various people (including Manish Amde and folks at the AMPLab, Databricks and Alpine Labs) are actively work on developing ensembles of decision trees (rand

Re: KMeans code is rubbish

2014-07-11 Thread Ameet Talwalkar
Hi Wanda, As Sean mentioned, K-means is not guaranteed to find an optimal answer, even for seemingly simple toy examples. A common heuristic to deal with this issue is to run kmeans multiple times and choose the best answer. You can do this by changing the runs parameter from the default value (1