The quest is dual:
- Increase utilisation- because cores cost money and I want to make sure that if I fully utilise what I pay for. This is very blunt of corse, because there is always i/o and at least some degree of skew. Bottom line is do the same thing over the same time but with fewer (but better utilised) resources. - Reduce runtime by increasing parallelism. While not the same, I am looking at these as two sides of the same coin. On Thu, Nov 15, 2018 at 6:58 PM Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > For that little data, I find spark.sql.shuffle.partitions = 30000 to be > very high. > > Any reason for that high value? > > > > Do you have a baseline observation with the default value? > > > > Also, enabling the jobgroup and job info through the API and observing > through the UI will help you understand the code snippets when you have low > utilization. > > > > Finally, high utilization does not equate to high efficiency. > > Its very likely that for your workload, you may only need 16-128 executors. > > I would suggest getting the partition count for the various > datasets/dataframes/rdds in your code by using > > > > dataset.rdd. getNumPartitions > > > > I would also suggest doing a number of tests with different number of > executors too. > > > > But coming back to the objective behind your quest – are you trying to > maximize utilization hoping that by having high parallelism will reduce > your total runtime? > > > > > > *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> > *Date: *Thursday, November 15, 2018 at 10:07 AM > *To: *<jthak...@conversantmedia.com> > *Cc: *user <user@spark.apache.org>, David Markovitz < > dudu.markov...@microsoft.com> > *Subject: *Re: How to address seemingly low core utilization on a spark > workload? > > > > I am working with parquets and the metadata reading there is quite fast as > there are at most 16 files (a couple of gigs each). > > > > I find it very hard to answer the question: "how many partitions do you > have?", many spark operations do not preserve partitioning and I have a lot > of filtering and grouping going on. > > What I *can* say is that I specified spark.sql.shuffle.partitions to > 30,000. > > > > I am not worried that there are not enough partitions to keep the cores > working. Having said that I do see that the high utilisation correlates > heavily with shuffle read/write. Whereas low utilisation correlates with no > shuffling. > > This leads me to the conclusion that compared to the amount of shuffling, > the cluster is doing very little work. > > > > Question is what can I do about it. > > > > On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh < > jthak...@conversantmedia.com> wrote: > > Can you shed more light on what kind of processing you are doing? > > > > One common pattern that I have seen for active core/executor utilization > dropping to zero is while reading ORC data and the driver seems (I think) > to be doing schema validation. > > In my case I would have hundreds of thousands of ORC data files and there > is dead silence for about 1-2 hours. > > I have tried providing a schema and disabling schema validation while > reading the ORC data, but that does not seem to help (Spark 2.2.1). > > > > And as you know, in most cases, there is a linear relationship between > number of partitions in your data and the concurrently active executors. > > > > Another thing I would suggest is use the following two API calls/method – > they will annotate the spark stages and jobs with what is being executed in > the Spark UI. > > SparkContext.setJobGroup(….) > > SparkContext.setJobDescription(….) > > > > *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> > *Date: *Thursday, November 15, 2018 at 8:51 AM > *To: *user <user@spark.apache.org> > *Cc: *David Markovitz <dudu.markov...@microsoft.com> > *Subject: *How to address seemingly low core utilization on a spark > workload? > > > > I have a workload that runs on a cluster of 300 cores. > > Below is a plot of the amount of active tasks over time during the > execution of this workload: > > > > [image: image.png] > > > > What I deduce is that there are substantial intervals where the cores are > heavily under-utilised. > > > > What actions can I take to: > > - Increase the efficiency (== core utilisation) of the cluster? > - Understand the root causes behind the drops in core utilisation? > >