Re: Executors idle, driver heap exploding and maxing only 1 cpu core

Nicholas Hakobian Thu, 23 May 2019 10:43:04 -0700

One potential case that can cause this is the optimizer being a little
overzealous with determining if a table can be broadcasted or not. Have you
checked the UI or query plan to see if any steps include a
BroadcastHashJoin? Its possible that the optimizer thinks that it should be
able to fit the table in memory from looking at its size on disk, but it
actually cannot fit in memory. In this case you might want to look at
tuning the autoBroadcastJoinThreshold.

Another potential case is that at the step it looks like the driver is
"hanging" its attempting to load in a data source that is backed by a very
large number of files. Spark maintains a cache of file paths for a data
source to determine task splits, and we've seen the driver appear to hang
and/or crash if you try to load in thousands (or more) of tiny files per
partition, and you have a large number of partitions.

Hope this helps.

Nicholas Szandor Hakobian, Ph.D.
Principal Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com

On Thu, May 23, 2019 at 7:36 AM Ashic Mahtab <as...@live.com> wrote:

> Hi,
> We have a quite long winded Spark application we inherited with many
> stages. When we run on our spark cluster, things start off well enough.
> Workers are busy, lots of progress made, etc. etc. However, 30 minutes into
> processing, we see CPU usage of the workers drop drastically. At this time,
> we also see that the driver is maxing out exactly one core (though we've
> given it more than one), and its ram usage is creeping up. At this time,
> there's no logs coming out on the driver. Everything seems to stop, and
> then it suddenly starts working, and the workers start working again. The
> driver ram doesn't go down, but flatlines. A few minutes later, the same
> thing happens again - the world seems to stop. However, the driver soon
> crashes with an out of memory exception.
>
> What could be causing this sort of behaviour on the driver? We don't have
> any collect() or similar functions in the code. We're reading in from Azure
> blobs, processing, and writing back to Azure blobs. Where should we start
> in trying to get to the bottom of this? We're running Spark 2.4.1 in a
> stand-alone cluster.
>
> Thanks,
> Ashic.
>

Re: Executors idle, driver heap exploding and maxing only 1 cpu core

Reply via email to