Hi Michael, Thanks for sharing the tip. It will help to the write path of the partitioned table. Do you have similar suggestion on reading the partitioned table back when there is a million of distinct values on the partition field (for example on user id)? Last time I have trouble to read a partitioned table because it takes very long (over hours on s3) to execute the sqlcontext.read.parquet("partitioned_table").
Best Regards, Jerry Sent from my iPhone > On 15 Jan, 2016, at 3:59 pm, Michael Armbrust <mich...@databricks.com> wrote: > > See here for some workarounds: > https://issues.apache.org/jira/browse/SPARK-12546 > >> On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam <chiling...@gmail.com> wrote: >> Hi Arkadiusz, >> >> the partitionBy is not designed to have many distinct value the last time I >> used it. If you search in the mailing list, I think there are couple of >> people also face similar issues. For example, in my case, it won't work over >> a million distinct user ids. It will require a lot of memory and very long >> time to read the table back. >> >> Best Regards, >> >> Jerry >> >>> On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz <arkadiusz.b...@gmail.com> >>> wrote: >>> Hi >>> >>> What is the proper configuration for saving parquet partition with >>> large number of repeated keys? >>> >>> On bellow code I load 500 milion rows of data and partition it on >>> column with not so many different values. >>> >>> Using spark-shell with 30g per executor and driver and 3 executor cores >>> >>> sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata") >>> >>> >>> Job failed because not enough memory in executor : >>> >>> WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by >>> YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory >>> used. Consider boosting spark.yarn.executor.memoryOverhead. >>> 16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on >>> datanode2.babar.poc: Container killed by YARN for exceeding memory >>> limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting >>> spark.yarn.executor.memoryOverhead. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >