Parallel tasks totally depends on the # of partitions that you are having,
if you are not receiving sufficient partitions (partitions > total # cores)
then try to do a .repartition.

Thanks
Best Regards

On Fri, Sep 25, 2015 at 1:44 PM, N B <nb.nos...@gmail.com> wrote:

> Hello all,
>
> I have a Spark streaming application that reads from a Flume Stream, does
> quite a few maps/filters in addition to a few reduceByKeyAndWindow and join
> operations before writing the analyzed output to ElasticSearch inside a
> foreachRDD()...
>
> I recently started to run this on a 2 node cluster (Standalone) with the
> driver program directly submitting to Spark master on the same host. The
> way I have divided the resources is as follows:
>
> N1: spark Master + driver + flume + 2 spark workers (16gb + 6 cores each
> worker)
> N2: 2 spark workers (16 gb + 8 cores each worker).
>
> The application works just fine but it is underusing N2 completely. It
> seems to use N1 (note that both executors on N1 get used) for all the
> analytics but when it comes to writing to Elasticsearch, it does divide the
> data around into all 4 executors which then write to ES on a separate host.
>
> I am puzzled as to why the data is not being distributed evenly from the
> get go into all 4 executors and why would it only do so in the final step
> of the pipeline which seems counterproductive as well?
>
> CPU usage on N1 is near the peak while on N2 is < 10% of overall capacity.
>
> Any help in getting the resources more evenly utilized on N1 and N2 is
> welcome.
>
> Thanks in advance,
> Nikunj
>
>

Reply via email to