Re: DataFrameWriter on partitionBy for parquet eat all RAM

Jerry Lam Fri, 15 Jan 2016 16:44:11 -0800

Hi Michael,

Thanks for sharing the tip. It will help to the write path of the partitioned 
table. 
Do you have similar suggestion on reading the partitioned table back when there 
is a million of distinct values on the partition field (for example on user 
id)? Last time I have trouble to read a partitioned table because it takes very 
long (over hours on s3) to execute the 
sqlcontext.read.parquet("partitioned_table").


Best Regards,

Jerry

Sent from my iPhone

> On 15 Jan, 2016, at 3:59 pm, Michael Armbrust <mich...@databricks.com> wrote:
> 
> See here for some workarounds: 
> https://issues.apache.org/jira/browse/SPARK-12546
> 
>> On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam <chiling...@gmail.com> wrote:
>> Hi Arkadiusz,
>> 
>> the partitionBy is not designed to have many distinct value the last time I 
>> used it. If you search in the mailing list, I think there are couple of 
>> people also face similar issues. For example, in my case, it won't work over 
>> a million distinct user ids. It will require a lot of memory and very long 
>> time to read the table back. 
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>>> On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz <arkadiusz.b...@gmail.com> 
>>> wrote:
>>> Hi
>>> 
>>> What is the proper configuration for saving parquet partition with
>>> large number of repeated keys?
>>> 
>>> On bellow code I load 500 milion rows of data and partition it on
>>> column with not so many different values.
>>> 
>>> Using spark-shell with 30g per executor and driver and 3 executor cores
>>> 
>>> sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata")
>>> 
>>> 
>>> Job failed because not enough memory in executor :
>>> 
>>> WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by
>>> YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory
>>> used. Consider boosting spark.yarn.executor.memoryOverhead.
>>> 16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on
>>> datanode2.babar.poc: Container killed by YARN for exceeding memory
>>> limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting
>>> spark.yarn.executor.memoryOverhead.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: DataFrameWriter on partitionBy for parquet eat all RAM

Reply via email to