Oh, regarding and shuffle.partitions being 30k, don't know. I inherited the
workload from an engineer that is no longer around and am trying to make
sense of things in general.

On Thu, Nov 15, 2018 at 7:26 PM Vitaliy Pisarev <
vitaliy.pisa...@biocatch.com> wrote:

> The quest is dual:
>
>
>    - Increase utilisation- because cores cost money and I want to make
>    sure that if I fully utilise what I pay for. This is very blunt of corse,
>    because there is always i/o and at least some degree of skew. Bottom line
>    is do the same thing over the same time but with fewer (but better
>    utilised) resources.
>    - Reduce runtime by increasing parallelism.
>
> While not the same, I am looking at these as two sides of the same coin.
>
>
>
>
>
> On Thu, Nov 15, 2018 at 6:58 PM Thakrar, Jayesh <
> jthak...@conversantmedia.com> wrote:
>
>> For that little data, I find spark.sql.shuffle.partitions = 30000 to be
>> very high.
>>
>> Any reason for that high value?
>>
>>
>>
>> Do you have a baseline observation with the default value?
>>
>>
>>
>> Also, enabling the jobgroup and job info through the API and observing
>> through the UI will help you understand the code snippets when you have low
>> utilization.
>>
>>
>>
>> Finally, high utilization does not equate to high efficiency.
>>
>> Its very likely that for your workload, you may only need 16-128
>> executors.
>>
>> I would suggest getting the partition count for the various
>> datasets/dataframes/rdds in your code by using
>>
>>
>>
>> dataset.rdd. getNumPartitions
>>
>>
>>
>> I would also suggest doing a number of tests with different number of
>> executors too.
>>
>>
>>
>> But coming back to the objective behind your quest – are you trying to
>> maximize utilization hoping that by having high parallelism will reduce
>> your total runtime?
>>
>>
>>
>>
>>
>> *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
>> *Date: *Thursday, November 15, 2018 at 10:07 AM
>> *To: *<jthak...@conversantmedia.com>
>> *Cc: *user <user@spark.apache.org>, David Markovitz <
>> dudu.markov...@microsoft.com>
>> *Subject: *Re: How to address seemingly low core utilization on a spark
>> workload?
>>
>>
>>
>> I am working with parquets and the metadata reading there is quite fast
>> as there are at most 16 files (a couple of gigs each).
>>
>>
>>
>> I find it very hard to answer the question: "how many partitions do you
>> have?", many spark operations do not preserve partitioning and I have a lot
>> of filtering and grouping going on.
>>
>> What I *can* say is that I specified spark.sql.shuffle.partitions to
>> 30,000.
>>
>>
>>
>> I am not worried that there are not enough partitions to keep the cores
>> working. Having said that I do see that the high utilisation correlates
>> heavily with shuffle read/write. Whereas low utilisation correlates with no
>> shuffling.
>>
>> This leads me to the conclusion that compared to the amount of shuffling,
>> the cluster is doing very little work.
>>
>>
>>
>> Question is what can I do about it.
>>
>>
>>
>> On Thu, Nov 15, 2018 at 5:29 PM Thakrar, Jayesh <
>> jthak...@conversantmedia.com> wrote:
>>
>> Can you shed more light on what kind of processing you are doing?
>>
>>
>>
>> One common pattern that I have seen for active core/executor utilization
>> dropping to zero is while reading ORC data and the driver seems (I think)
>> to be doing schema validation.
>>
>> In my case I would have hundreds of thousands of ORC data files and there
>> is dead silence for about 1-2 hours.
>>
>> I have tried providing a schema and disabling schema validation while
>> reading the ORC data, but that does not seem to help (Spark 2.2.1).
>>
>>
>>
>> And as you know, in most cases, there is a linear relationship between
>> number of partitions in your data and the concurrently active executors.
>>
>>
>>
>> Another thing I would suggest is use the following two API calls/method –
>> they will annotate the spark stages and jobs with what is being executed in
>> the Spark UI.
>>
>> SparkContext.setJobGroup(….)
>>
>> SparkContext.setJobDescription(….)
>>
>>
>>
>> *From: *Vitaliy Pisarev <vitaliy.pisa...@biocatch.com>
>> *Date: *Thursday, November 15, 2018 at 8:51 AM
>> *To: *user <user@spark.apache.org>
>> *Cc: *David Markovitz <dudu.markov...@microsoft.com>
>> *Subject: *How to address seemingly low core utilization on a spark
>> workload?
>>
>>
>>
>> I have a workload that runs on a cluster of 300 cores.
>>
>> Below is a plot of the amount of active tasks over time during the
>> execution of this workload:
>>
>>
>>
>> [image: image.png]
>>
>>
>>
>> What I deduce is that there are substantial intervals where the cores are
>> heavily under-utilised.
>>
>>
>>
>> What actions can I take to:
>>
>>    - Increase the efficiency (== core utilisation) of the cluster?
>>    - Understand the root causes behind the drops in core utilisation?
>>
>>

Reply via email to