Dear James,
- check the Spark documentation to see the actions that return a lot of
data back to the driver. One of these actions is collect(). However,
take(x) is an action, also reduce() is an action.
Before executing collect() find out what is the size of your RDD/DF.
- I cannot understand the phrase "hdfs directly from the executor". You
can specify an hdfs file as your input and also you can use hdfs to
store your output.
regards,
Apostolos
On 07/09/2018 05:04 μμ, James Starks wrote:
I have a Spark job that read data from database. By increasing submit
parameter '--driver-memory 25g' the job can works without a problem
locally but not in prod env because prod master do not have enough
capacity.
So I have a few questions:
- What functions such as collecct() would cause the data to be sent
back to the driver program?
My job so far merely uses `as`, `filter`, `map`, and `filter`.
- Is it possible to write data (in parquet format for instance) to
hdfs directly from the executor? If so how can I do (any code snippet,
doc for reference, or what keyword to search cause can't find by e.g.
`spark direct executor hdfs write`)?
Thanks
--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org