Re: Spark job's driver programe consums too much memory

Apostolos N. Papadopoulos Fri, 07 Sep 2018 07:16:02 -0700

Dear James,

- check the Spark documentation to see the actions that return a lot ofdata back to the driver. One of these actions is collect(). However,take(x) is an action, also reduce() is an action.


Before executing collect() find out what is the size of your RDD/DF.

- I cannot understand the phrase "hdfs directly from the executor". Youcan specify an hdfs file as your input and also you can use hdfs tostore your output.



regards,

Apostolos



On 07/09/2018 05:04 μμ, James Starks wrote:

I have a Spark job that read data from database. By increasing submitparameter '--driver-memory 25g' the job can works without a problemlocally but not in prod env because prod master do not have enoughcapacity.
So I have a few questions:
- What functions such as collecct() would cause the data to be sentback to the driver program?
  My job so far merely uses `as`, `filter`, `map`, and `filter`.
- Is it possible to write data (in parquet format for instance) tohdfs directly from the executor? If so how can I do (any code snippet,doc for reference, or what keyword to search cause can't find by e.g.`spark direct executor hdfs write`)?
Thanks


--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark job's driver programe consums too much memory

Reply via email to