On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira <h...@inesctec.pt> wrote:
> Hi,
>
> I am currently experimenting with linear regression (SGD) (Spark + MLlib,
> ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I
> do this (for now) by an exhaustive grid search of the step size and the
> number of iterations. Currently I am on a dual core that acts as a master
> (local mode for now but will be adding spark worker later). In order to
> maximize throughput I need to execute each execution of the linear
> regression algorithm in parallel.
>

How big is your dataset? If it is small or medium-sized, you might get better
performance by broadcasting the entire dataset and use a single machine solver
on each workers.

> According to the documentation it seems like parallel jobs may be scheduled
> if they are executed in separate threads [1]. So this brings me to my first
> question: does this mean I am CPU bound by the Spark master? In other words
> the maximum number of jobs = maximum number of threads of the OS?
>

We use the driver to collect model updates. Increasing the number of
parallel jobs
also increasing the driver load for both communication and computation. I don't
think you need to worry much about the max number of threads, which is usually
much larger than the number of parallel jobs we can actually run.

> I searched the mailing list but did not find anything regarding MLlib
> itself. I even peaked into the new MLlib API that uses pipelines and has
> support for parameter tuning. However, it looks like each job (instance of
> the learning algorithm) is executed in sequence. Can anyone confirm this?
> This brings me to my 2ndo question: is their any example that shows how one
> can execute MLlib algorithms as parallel jobs?
>

The new API is not optimized for performance yet. There is an example
here for k-means:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L393

> Finally, is their any general technique I can use to execute an algorithm in
> a distributed manner using Spark? More specifically I would like to have
> several MLlib algorithms run in parallel. Can anyone show me an example of
> sorts to do this?
>
> TIA.
> Hugo F.
>
>
>
>
>
>
>
> [1] https://spark.apache.org/docs/1.2.0/job-scheduling.html
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to