On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira <h...@inesctec.pt> wrote: > Hi, > > I am currently experimenting with linear regression (SGD) (Spark + MLlib, > ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I > do this (for now) by an exhaustive grid search of the step size and the > number of iterations. Currently I am on a dual core that acts as a master > (local mode for now but will be adding spark worker later). In order to > maximize throughput I need to execute each execution of the linear > regression algorithm in parallel. >
How big is your dataset? If it is small or medium-sized, you might get better performance by broadcasting the entire dataset and use a single machine solver on each workers. > According to the documentation it seems like parallel jobs may be scheduled > if they are executed in separate threads [1]. So this brings me to my first > question: does this mean I am CPU bound by the Spark master? In other words > the maximum number of jobs = maximum number of threads of the OS? > We use the driver to collect model updates. Increasing the number of parallel jobs also increasing the driver load for both communication and computation. I don't think you need to worry much about the max number of threads, which is usually much larger than the number of parallel jobs we can actually run. > I searched the mailing list but did not find anything regarding MLlib > itself. I even peaked into the new MLlib API that uses pipelines and has > support for parameter tuning. However, it looks like each job (instance of > the learning algorithm) is executed in sequence. Can anyone confirm this? > This brings me to my 2ndo question: is their any example that shows how one > can execute MLlib algorithms as parallel jobs? > The new API is not optimized for performance yet. There is an example here for k-means: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L393 > Finally, is their any general technique I can use to execute an algorithm in > a distributed manner using Spark? More specifically I would like to have > several MLlib algorithms run in parallel. Can anyone show me an example of > sorts to do this? > > TIA. > Hugo F. > > > > > > > > [1] https://spark.apache.org/docs/1.2.0/job-scheduling.html > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org