Hello All,
We are trying to use a custom appender for Spark driver and executor pods.
However, the changes to log4j.properties file in the spark container image
are not taking effect. We even tried simpler changes like changing the
logging level to DEBUG.
Has anyone run into similar issues? or su
I understand the error is because the number of partitions is very high,
yet when processing 40 TB (and this number is expected to grow) this number
seems reasonable:
40TB / 300,000 will result in partitions size of ~ 130MB (data should be
evenly distributed).
On Fri, Sep 7, 2018 at 6:28 PM Vadim
You have too many partitions, so when the driver is trying to gather
the status of all map outputs and send back to executors it chokes on
the size of the structure that needs to be GZipped, and since it's
bigger than 2GiB, it produces OOM.
On Fri, Sep 7, 2018 at 10:35 AM Harel Gliksman wrote:
>
>
Yes I think I am confused because originally my thought was that executor only
requires 10g then driver ideally do not need to consume more than 10g or at
least not more than 20g. But this is not the case. My configuration is setting
--dervier-memory to 25g and --executor-memory 10g. And my prog
You are putting all together and this does not make sense. Writing data
to HDFS does not require that all data should be transfered back to the
driver and THEN saved to HDFS.
This would be a disaster and it would never scale. I suggest to check
the documentation more carefully because I believ
Is df.write.mode(...).parquet("hdfs://..") also actions function? Checking doc
shows that my spark doesn't use those actions functions. But save functions
looks resembling the function
df.write.mode(overwrite).parquet("hdfs://path/to/parquet-file") used by my
spark job uses. Therefore I a
Hi,
We are running a Spark (2.3.1) job on an EMR cluster with 500 r3.2xlarge
(60 GB, 8 vcores, 160 GB SSD ). Driver memory is set to 25GB.
It processes ~40 TB of data using aggregateByKey in which we specify
numPartitions = 300,000.
Map side tasks succeed, but reduce side tasks all fail.
We noti
Dear James,
- check the Spark documentation to see the actions that return a lot of
data back to the driver. One of these actions is collect(). However,
take(x) is an action, also reduce() is an action.
Before executing collect() find out what is the size of your RDD/DF.
- I cannot understan
I have a Spark job that read data from database. By increasing submit parameter
'--driver-memory 25g' the job can works without a problem locally but not in
prod env because prod master do not have enough capacity.
So I have a few questions:
- What functions such as collecct() would cause the
Got the root cause eventually as it throws java.lang.OutOfMemoryError: Java
heap space. Increasing --driver-memory temporarily fixes the problem. Thanks.
‐‐‐ Original Message ‐‐‐
On 7 September 2018 12:32 PM, Femi Anthony
wrote:
> One way I would go about this would be to try running a
One way I would go about this would be to try running a new_df.show(numcols,
truncate=False) on a few columns before you try writing to parquet to force
computation of newdf and see whether the hanging is occurring at that point
or during the write. You may also try doing a newdf.count() as well.
Hey all,
I would like to unsubscribe from this mailing list. thank you
--
This e-mail and any attachments are confidential and intended
solely for the use of the recipient(s) to whom they are addressed. If you
have received it in error, please destroy all copies and inform the sender
Unsubscribe
I am sure, all writen as my first post.
So this make me very confusing.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
I have a Spark job that reads from a postgresql (v9.5) table, and write result
to parquet. The code flow is not complicated, basically
case class MyCaseClass(field1: String, field2: String)
val df = spark.read.format("jdbc")...load()
df.createOrReplaceTempView(...)
val newdf = spa
15 matches
Mail list logo