Oh, regarding and shuffle.partitions being 30k, don't know. I inherited the workload from an engineer that is no longer around and am trying to make sense of things in general.
On Thu, Nov 15, 2018 at 7:26 PM Vitaliy Pisarev < vitaliy.pisa...@biocatch.com> wrote: > The quest is dual: > > > - Increase utilisation- because cores cost money and I want to make > sure that if I fully utilise what I pay for. This is very blunt of corse, > because there is always i/o and at least some degree of skew. Bottom line > is do the same thing over the same time but with fewer (but better > utilised) resources. > - Reduce runtime by increasing parallelism. > > While not the same, I am looking at these as two sides of the same coin. > > > > > > On Thu, Nov 15, 2018 at 6:58 PM Thakrar, Jayesh < > jthak...@conversantmedia.com> wrote: > >> For that little data, I find spark.sql.shuffle.partitions = 30000 to be >> very high. >> >> Any reason for that high value? >> >> >> >> Do you have a baseline observation with the default value? >> >> >> >> Also, enabling the jobgroup and job info through the API and observing >> through the UI will help you understand the code snippets when you have low >> utilization. >> >> >> >> Finally, high utilization does not equate to high efficiency. >> >> Its very likely that for your workload, you may only need 16-128 >> executors. >> >> I would suggest getting the partition count for the various >> datasets/dataframes/rdds in your code by using >> >> >> >> dataset.rdd. getNumPartitions >> >> >> >> I would also suggest doing a number of tests with different number of >> executors too. >> >> >> >> But coming back to the objective behind your quest – are you trying to >> maximize utilization hoping that by having high parallelism will reduce >> your total runtime? >> >> >> >> >> >> *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> >> *Date: *Thursday, November 15, 2018 at 10:07 AM >> *To: *<jthak...@conversantmedia.com> >> *Cc: *user <user@spark.apache.org>, David Markovitz < >> dudu.markov...@microsoft.com> >> *Subject: *Re: How to address seemingly low core utilization on a spark >> workload? >> >> >> >> I am working with parquets and the metadata reading there is quite fast >> as there are at most 16 files (a couple of gigs each). >> >> >> >> I find it very hard to answer the question: "how many partitions do you >> have?", many spark operations do not preserve partitioning and I have a lot >> of filtering and grouping going on. >> >> What I *can* say is that I specified spark.sql.shuffle.partitions to >> 30,000. >> >> >> >> I am not worried that there are not enough partitions to keep the cores >> working. Having said that I do see that the high utilisation correlates >> heavily with shuffle read/write. Whereas low utilisation correlates with no >> shuffling. >> >> This leads me to the conclusion that compared to the amount of shuffling, >> the cluster is doing very little work. >> >> >> >> Question is what can I do about it. >> >> >> >> On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh < >> jthak...@conversantmedia.com> wrote: >> >> Can you shed more light on what kind of processing you are doing? >> >> >> >> One common pattern that I have seen for active core/executor utilization >> dropping to zero is while reading ORC data and the driver seems (I think) >> to be doing schema validation. >> >> In my case I would have hundreds of thousands of ORC data files and there >> is dead silence for about 1-2 hours. >> >> I have tried providing a schema and disabling schema validation while >> reading the ORC data, but that does not seem to help (Spark 2.2.1). >> >> >> >> And as you know, in most cases, there is a linear relationship between >> number of partitions in your data and the concurrently active executors. >> >> >> >> Another thing I would suggest is use the following two API calls/method – >> they will annotate the spark stages and jobs with what is being executed in >> the Spark UI. >> >> SparkContext.setJobGroup(….) >> >> SparkContext.setJobDescription(….) >> >> >> >> *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com> >> *Date: *Thursday, November 15, 2018 at 8:51 AM >> *To: *user <user@spark.apache.org> >> *Cc: *David Markovitz <dudu.markov...@microsoft.com> >> *Subject: *How to address seemingly low core utilization on a spark >> workload? >> >> >> >> I have a workload that runs on a cluster of 300 cores. >> >> Below is a plot of the amount of active tasks over time during the >> execution of this workload: >> >> >> >> [image: image.png] >> >> >> >> What I deduce is that there are substantial intervals where the cores are >> heavily under-utilised. >> >> >> >> What actions can I take to: >> >> - Increase the efficiency (== core utilisation) of the cluster? >> - Understand the root causes behind the drops in core utilisation? >> >>