That means only Thanks Best Regards
On Sun, Sep 27, 2015 at 12:07 AM, N B <nb.nos...@gmail.com> wrote: > Hello, > > Does anyone have an insight into what could be the issue here? > > Thanks > Nikunj > > > On Fri, Sep 25, 2015 at 10:44 AM, N B <nb.nos...@gmail.com> wrote: > >> Hi Akhil, >> >> I do have 25 partitions being created. I have set >> the spark.default.parallelism property to 25. Batch size is 30 seconds and >> block interval is 1200 ms which also gives us roughly 25 partitions from >> the input stream. I can see 25 partitions being created and used in the >> Spark UI also. Its just that those tasks are waiting for cores on N1 to get >> free before being scheduled while N2 is sitting idle. >> >> The cluster configuration is: >> >> N1: 2 workers, 6 cores and 16gb memory => 12 cores on the first node. >> N2: 2 workers, 8 cores and 16 gb memory => 16 cores on the second node. >> >> for a grand total of 28 cores. But it still does most of the processing >> on N1 (divided among the 2 workers running) but almost completely >> disregarding N2 until its the final stage where data is being written out >> to Elasticsearch. I am not sure I understand the reason behind it not >> distributing more partitions to N2 to begin with and use it effectively. >> Since there are only 12 cores on N1 and 25 total partitions, shouldn't it >> send some of those partitions to N2 as well? >> >> Thanks >> Nikunj >> >> >> On Fri, Sep 25, 2015 at 5:28 AM, Akhil Das <ak...@sigmoidanalytics.com> >> wrote: >> >>> Parallel tasks totally depends on the # of partitions that you are >>> having, if you are not receiving sufficient partitions (partitions > total >>> # cores) then try to do a .repartition. >>> >>> Thanks >>> Best Regards >>> >>> On Fri, Sep 25, 2015 at 1:44 PM, N B <nb.nos...@gmail.com> wrote: >>> >>>> Hello all, >>>> >>>> I have a Spark streaming application that reads from a Flume Stream, >>>> does quite a few maps/filters in addition to a few reduceByKeyAndWindow and >>>> join operations before writing the analyzed output to ElasticSearch inside >>>> a foreachRDD()... >>>> >>>> I recently started to run this on a 2 node cluster (Standalone) with >>>> the driver program directly submitting to Spark master on the same host. >>>> The way I have divided the resources is as follows: >>>> >>>> N1: spark Master + driver + flume + 2 spark workers (16gb + 6 cores >>>> each worker) >>>> N2: 2 spark workers (16 gb + 8 cores each worker). >>>> >>>> The application works just fine but it is underusing N2 completely. It >>>> seems to use N1 (note that both executors on N1 get used) for all the >>>> analytics but when it comes to writing to Elasticsearch, it does divide the >>>> data around into all 4 executors which then write to ES on a separate host. >>>> >>>> I am puzzled as to why the data is not being distributed evenly from >>>> the get go into all 4 executors and why would it only do so in the final >>>> step of the pipeline which seems counterproductive as well? >>>> >>>> CPU usage on N1 is near the peak while on N2 is < 10% of overall >>>> capacity. >>>> >>>> Any help in getting the resources more evenly utilized on N1 and N2 is >>>> welcome. >>>> >>>> Thanks in advance, >>>> Nikunj >>>> >>>> >>> >> >