See here for some workarounds:
https://issues.apache.org/jira/browse/SPARK-12546

On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam <chiling...@gmail.com> wrote:

> Hi Arkadiusz,
>
> the partitionBy is not designed to have many distinct value the last time
> I used it. If you search in the mailing list, I think there are couple of
> people also face similar issues. For example, in my case, it won't work
> over a million distinct user ids. It will require a lot of memory and very
> long time to read the table back.
>
> Best Regards,
>
> Jerry
>
> On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz <arkadiusz.b...@gmail.com>
> wrote:
>
>> Hi
>>
>> What is the proper configuration for saving parquet partition with
>> large number of repeated keys?
>>
>> On bellow code I load 500 milion rows of data and partition it on
>> column with not so many different values.
>>
>> Using spark-shell with 30g per executor and driver and 3 executor cores
>>
>>
>> sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata")
>>
>>
>> Job failed because not enough memory in executor :
>>
>> WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by
>> YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory
>> used. Consider boosting spark.yarn.executor.memoryOverhead.
>> 16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on
>> datanode2.babar.poc: Container killed by YARN for exceeding memory
>> limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting
>> spark.yarn.executor.memoryOverhead.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to