The quest is dual:

   - Increase utilisation- because cores cost money and I want to make sure
   that if I fully utilise what I pay for. This is very blunt of corse,
   because there is always i/o and at least some degree of skew. Bottom line
   is do the same thing over the same time but with fewer (but better
   utilised) resources.
   - Reduce runtime by increasing parallelism.

While not the same, I am looking at these as two sides of the same coin.





On Thu, Nov 15, 2018 at 6:58 PM Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> For that little data, I find spark.sql.shuffle.partitions = 30000 to be
> very high.
>
> Any reason for that high value?
>
>
>
> Do you have a baseline observation with the default value?
>
>
>
> Also, enabling the jobgroup and job info through the API and observing
> through the UI will help you understand the code snippets when you have low
> utilization.
>
>
>
> Finally, high utilization does not equate to high efficiency.
>
> Its very likely that for your workload, you may only need 16-128 executors.
>
> I would suggest getting the partition count for the various
> datasets/dataframes/rdds in your code by using
>
>
>
> dataset.rdd. getNumPartitions
>
>
>
> I would also suggest doing a number of tests with different number of
> executors too.
>
>
>
> But coming back to the objective behind your quest – are you trying to
> maximize utilization hoping that by having high parallelism will reduce
> your total runtime?
>
>
>
>
>
> *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
> *Date: *Thursday, November 15, 2018 at 10:07 AM
> *To: *<jthak...@conversantmedia.com>
> *Cc: *user <user@spark.apache.org>, David Markovitz <
> dudu.markov...@microsoft.com>
> *Subject: *Re: How to address seemingly low core utilization on a spark
> workload?
>
>
>
> I am working with parquets and the metadata reading there is quite fast as
> there are at most 16 files (a couple of gigs each).
>
>
>
> I find it very hard to answer the question: "how many partitions do you
> have?", many spark operations do not preserve partitioning and I have a lot
> of filtering and grouping going on.
>
> What I *can* say is that I specified spark.sql.shuffle.partitions to
> 30,000.
>
>
>
> I am not worried that there are not enough partitions to keep the cores
> working. Having said that I do see that the high utilisation correlates
> heavily with shuffle read/write. Whereas low utilisation correlates with no
> shuffling.
>
> This leads me to the conclusion that compared to the amount of shuffling,
> the cluster is doing very little work.
>
>
>
> Question is what can I do about it.
>
>
>
> On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh <
> jthak...@conversantmedia.com> wrote:
>
> Can you shed more light on what kind of processing you are doing?
>
>
>
> One common pattern that I have seen for active core/executor utilization
> dropping to zero is while reading ORC data and the driver seems (I think)
> to be doing schema validation.
>
> In my case I would have hundreds of thousands of ORC data files and there
> is dead silence for about 1-2 hours.
>
> I have tried providing a schema and disabling schema validation while
> reading the ORC data, but that does not seem to help (Spark 2.2.1).
>
>
>
> And as you know, in most cases, there is a linear relationship between
> number of partitions in your data and the concurrently active executors.
>
>
>
> Another thing I would suggest is use the following two API calls/method –
> they will annotate the spark stages and jobs with what is being executed in
> the Spark UI.
>
> SparkContext.setJobGroup(….)
>
> SparkContext.setJobDescription(….)
>
>
>
> *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
> *Date: *Thursday, November 15, 2018 at 8:51 AM
> *To: *user <user@spark.apache.org>
> *Cc: *David Markovitz <dudu.markov...@microsoft.com>
> *Subject: *How to address seemingly low core utilization on a spark
> workload?
>
>
>
> I have a workload that runs on a cluster of 300 cores.
>
> Below is a plot of the amount of active tasks over time during the
> execution of this workload:
>
>
>
> [image: image.png]
>
>
>
> What I deduce is that there are substantial intervals where the cores are
> heavily under-utilised.
>
>
>
> What actions can I take to:
>
>    - Increase the efficiency (== core utilisation) of the cluster?
>    - Understand the root causes behind the drops in core utilisation?
>
>

Reply via email to