Re: multi-threaded Spark jobs

Elango Cheran Tue, 26 Jan 2016 11:43:48 -0800

I think I understand what you're saying, but I think whether you're
"over-provisioning" or not depends on the nature of your workload, your
system's resources, and how Spark determines how to spawn task threads
inside executor processes.

As I concluded in the post, if you're doing CPU-bound work, then stick with
.map and don't bother with .mapPartitions.  If you have an asynchronous
API, you might be able to use the .mapPartitions strategy to parallelize
(in combination with core.async/FRP/etc.).  And if you're doing I/O-bound
work as I was, then a thread pool might be good, and it shouldn't be too
problematic since you're not contending with other threads much for CPU
time.

Unless there's something that I'm missing, or something about Spark's
provisioning of threads inside an executor that could be a problem here,
I'm not sure how other tasks would get pre-empted by threads doing blocking
I/O (meaning they're I/O-bound, not CPU-bound).

What I can attest to is that an ETL job whose runtime was dominated by the
I/O work used to consistently take 2h 45m to run with .map, and after using
this multi-threaded approach, the job started taking approx. 45-50 mins
consistently using a thread pool of size 3 threads.  3 threads => 1/3
runtime.  Increasing the size of the thread pool also led to efficient,
successful runs without any performance concerns.

-- Elango

On Mon, Jan 25, 2016 at 11:12 PM, Igor Berman <igor.ber...@gmail.com> wrote:

> IMHO, you are making mistake.
> spark manages tasks and cores internally. when you open new threads inside
> executor - meaning you "over-provisioning" executor(e.g. tasks on other
> cores will be preempted)
>
>
>
> On 26 January 2016 at 07:59, Elango Cheran <elango.che...@gmail.com>
> wrote:
>
>> Hi everyone,
>> I've gone through the effort of figuring out how to modify a Spark job to
>> have an operation become multi-threaded inside an executor.  I've written
>> up an explanation of what worked, what didn't work, and why:
>>
>>
>> http://www.elangocheran.com/blog/2016/01/using-clojure-to-create-multi-threaded-spark-jobs/
>>
>> I think the ideas there should be applicable generally -- which would
>> include Scala and Java since the JVM is genuinely multi-threaded -- and
>> therefore may be of interest to others.  I will need to convert this code
>> to Scala for personal requirements in the near future, anyways.
>>
>> I hope this helps.
>>
>> -- Elango
>>
>
>

Re: multi-threaded Spark jobs

Reply via email to