One potential case that can cause this is the optimizer being a little overzealous with determining if a table can be broadcasted or not. Have you checked the UI or query plan to see if any steps include a BroadcastHashJoin? Its possible that the optimizer thinks that it should be able to fit the table in memory from looking at its size on disk, but it actually cannot fit in memory. In this case you might want to look at tuning the autoBroadcastJoinThreshold.
Another potential case is that at the step it looks like the driver is "hanging" its attempting to load in a data source that is backed by a very large number of files. Spark maintains a cache of file paths for a data source to determine task splits, and we've seen the driver appear to hang and/or crash if you try to load in thousands (or more) of tiny files per partition, and you have a large number of partitions. Hope this helps. Nicholas Szandor Hakobian, Ph.D. Principal Data Scientist Rally Health nicholas.hakob...@rallyhealth.com On Thu, May 23, 2019 at 7:36 AM Ashic Mahtab <as...@live.com> wrote: > Hi, > We have a quite long winded Spark application we inherited with many > stages. When we run on our spark cluster, things start off well enough. > Workers are busy, lots of progress made, etc. etc. However, 30 minutes into > processing, we see CPU usage of the workers drop drastically. At this time, > we also see that the driver is maxing out exactly one core (though we've > given it more than one), and its ram usage is creeping up. At this time, > there's no logs coming out on the driver. Everything seems to stop, and > then it suddenly starts working, and the workers start working again. The > driver ram doesn't go down, but flatlines. A few minutes later, the same > thing happens again - the world seems to stop. However, the driver soon > crashes with an out of memory exception. > > What could be causing this sort of behaviour on the driver? We don't have > any collect() or similar functions in the code. We're reading in from Azure > blobs, processing, and writing back to Azure blobs. Where should we start > in trying to get to the bottom of this? We're running Spark 2.4.1 in a > stand-alone cluster. > > Thanks, > Ashic. >