See here for some workarounds: https://issues.apache.org/jira/browse/SPARK-12546
On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Arkadiusz, > > the partitionBy is not designed to have many distinct value the last time > I used it. If you search in the mailing list, I think there are couple of > people also face similar issues. For example, in my case, it won't work > over a million distinct user ids. It will require a lot of memory and very > long time to read the table back. > > Best Regards, > > Jerry > > On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz <arkadiusz.b...@gmail.com> > wrote: > >> Hi >> >> What is the proper configuration for saving parquet partition with >> large number of repeated keys? >> >> On bellow code I load 500 milion rows of data and partition it on >> column with not so many different values. >> >> Using spark-shell with 30g per executor and driver and 3 executor cores >> >> >> sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata") >> >> >> Job failed because not enough memory in executor : >> >> WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by >> YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory >> used. Consider boosting spark.yarn.executor.memoryOverhead. >> 16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on >> datanode2.babar.poc: Container killed by YARN for exceeding memory >> limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting >> spark.yarn.executor.memoryOverhead. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >