Re: Preserving partitioning with dataframe select

2016-02-09 Thread Michael Armbrust
RDD level partitioning information is not used to decide when to shuffle for queries planned using Catalyst (since we have better information about distribution from the query plan itself). Instead you should be looking at the logic in EnsureRequirements

Re: Preserving partitioning with dataframe select

2016-02-08 Thread Matt Cheah
be appreciated! -Matt Cheah From: Reynold Xin Date: Sunday, February 7, 2016 at 11:11 PM To: Matt Cheah Cc: "dev@spark.apache.org" , Mingyu Kim Subject: Re: Preserving partitioning with dataframe select Matt, Thanks for the email. Are you just asking whether it shoul

Re: Preserving partitioning with dataframe select

2016-02-07 Thread Reynold Xin
Matt, Thanks for the email. Are you just asking whether it should work, or reporting they don't work? Internally, the way we track physical data distribution should make the scenarios described work. If it doesn't, we should make them work. On Sat, Feb 6, 2016 at 6:49 AM, Matt Cheah wrote: >