DataFrameWriter on partitionBy for parquet eat all RAM

Arkadiusz Bicz Thu, 14 Jan 2016 11:33:08 -0800

Hi

What is the proper configuration for saving parquet partition with
large number of repeated keys?


On bellow code I load 500 milion rows of data and partition it on
column with not so many different values.

Using spark-shell with 30g per executor and driver and 3 executor cores

sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata")


Job failed because not enough memory in executor :

WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by
YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory
used. Consider boosting spark.yarn.executor.memoryOverhead.
16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on
datanode2.babar.poc: Container killed by YARN for exceeding memory
limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

DataFrameWriter on partitionBy for parquet eat all RAM

Reply via email to