Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-15 Thread Jerry Lam
Hi Michael, Thanks for sharing the tip. It will help to the write path of the partitioned table. Do you have similar suggestion on reading the partitioned table back when there is a million of distinct values on the partition field (for example on user id)? Last time I have trouble to read a p

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-15 Thread Michael Armbrust
See here for some workarounds: https://issues.apache.org/jira/browse/SPARK-12546 On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam wrote: > Hi Arkadiusz, > > the partitionBy is not designed to have many distinct value the last time > I used it. If you search in the mailing list, I think there are coupl

Re: DataFrameWriter on partitionBy for parquet eat all RAM

2016-01-14 Thread Jerry Lam
Hi Arkadiusz, the partitionBy is not designed to have many distinct value the last time I used it. If you search in the mailing list, I think there are couple of people also face similar issues. For example, in my case, it won't work over a million distinct user ids. It will require a lot of memor