Hi Michael,
Thanks for sharing the tip. It will help to the write path of the partitioned
table.
Do you have similar suggestion on reading the partitioned table back when there
is a million of distinct values on the partition field (for example on user
id)? Last time I have trouble to read a partitioned table because it takes very
long (over hours on s3) to execute the
sqlcontext.read.parquet("partitioned_table").
Best Regards,
Jerry
Sent from my iPhone
> On 15 Jan, 2016, at 3:59 pm, Michael Armbrust <[email protected]> wrote:
>
> See here for some workarounds:
> https://issues.apache.org/jira/browse/SPARK-12546
>
>> On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam <[email protected]> wrote:
>> Hi Arkadiusz,
>>
>> the partitionBy is not designed to have many distinct value the last time I
>> used it. If you search in the mailing list, I think there are couple of
>> people also face similar issues. For example, in my case, it won't work over
>> a million distinct user ids. It will require a lot of memory and very long
>> time to read the table back.
>>
>> Best Regards,
>>
>> Jerry
>>
>>> On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz <[email protected]>
>>> wrote:
>>> Hi
>>>
>>> What is the proper configuration for saving parquet partition with
>>> large number of repeated keys?
>>>
>>> On bellow code I load 500 milion rows of data and partition it on
>>> column with not so many different values.
>>>
>>> Using spark-shell with 30g per executor and driver and 3 executor cores
>>>
>>> sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata")
>>>
>>>
>>> Job failed because not enough memory in executor :
>>>
>>> WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by
>>> YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory
>>> used. Consider boosting spark.yarn.executor.memoryOverhead.
>>> 16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on
>>> datanode2.babar.poc: Container killed by YARN for exceeding memory
>>> limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting
>>> spark.yarn.executor.memoryOverhead.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>