Parallel tasks totally depends on the # of partitions that you are having, if you are not receiving sufficient partitions (partitions > total # cores) then try to do a .repartition.
Thanks Best Regards On Fri, Sep 25, 2015 at 1:44 PM, N B <nb.nos...@gmail.com> wrote: > Hello all, > > I have a Spark streaming application that reads from a Flume Stream, does > quite a few maps/filters in addition to a few reduceByKeyAndWindow and join > operations before writing the analyzed output to ElasticSearch inside a > foreachRDD()... > > I recently started to run this on a 2 node cluster (Standalone) with the > driver program directly submitting to Spark master on the same host. The > way I have divided the resources is as follows: > > N1: spark Master + driver + flume + 2 spark workers (16gb + 6 cores each > worker) > N2: 2 spark workers (16 gb + 8 cores each worker). > > The application works just fine but it is underusing N2 completely. It > seems to use N1 (note that both executors on N1 get used) for all the > analytics but when it comes to writing to Elasticsearch, it does divide the > data around into all 4 executors which then write to ES on a separate host. > > I am puzzled as to why the data is not being distributed evenly from the > get go into all 4 executors and why would it only do so in the final step > of the pipeline which seems counterproductive as well? > > CPU usage on N1 is near the peak while on N2 is < 10% of overall capacity. > > Any help in getting the resources more evenly utilized on N1 and N2 is > welcome. > > Thanks in advance, > Nikunj > >