Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
From: Khaled Hammouda [mailto:khaled.hammo...@kik.com]
Sent: Thursday, June 16, 2016 11:45 AM
To: Mohammed Guller
Cc: user
Subject: Re: Spark SQL driver memory keeps rising
I'm using pyspark and running in YARN client mode. I managed t
I'm using pyspark and running in YARN client mode. I managed to anonymize
the code a bit and pasted it below.
You'll notice that I don't collect any output in the driver, instead the
data is written to parquet directly. Also notice that I increased
spark.driver.maxResultSize to 10g because the job
you will need to be more specific about how you are using these parameters.
have you looked at spark WEB GUI (default port 4040) to see the jobs and
stages. the amount of shuffle will also be given.
also it helps if you do jps on OS and send the output of ps aux|grep ,PID> as
well.
What sort of
It would be hard to guess what could be going on without looking at the code.
It looks like the driver program goes into a long stop-the-world GC pause. This
should not happen on the machine running the driver program if all that you are
doing is reading data from HDFS, perform a bunch of transf