I'm having trouble with a Spark SQL job in which I run a series of SQL
transformations on data loaded from HDFS.

The first two stages load data from hdfs input without issues, but later
stages that require shuffles cause the driver memory to keep rising until
it is exhausted, and then the driver stalls, the spark UI stops responding,
and the I can't even kill the driver with ^C, I have to forcibly kill the
process.

I think I'm allocating enough memory to the driver: driver memory is 44 GB,
and spark.driver.memoryOverhead is 4.5 GB. When I look at the memory usage,
the driver memory before the shuffle starts is at about 2.4 GB (virtual mem
size for the driver process is about 50 GB), and then once the stages that
require shuffle start I can see the driver memory rising fast to about 47
GB, then everything stops responding.

I'm not invoking any output operation that collects data at the driver. I
just call .cache() on a couple of dataframes since they get used more than
once in the SQL transformations, but those should be cached on the workers.
Then I write the final result to a parquet file, but the job doesn't get to
this final stage.

What could possibly be causing the driver memory to rise that fast when no
data is being collected at the driver?

Thanks,
Khaled

Reply via email to