Re: Preserving partitioning with dataframe select

2016-02-09 Thread Michael Armbrust
RDD level partitioning information is not used to decide when to shuffle for queries planned using Catalyst (since we have better information about distribution from the query plan itself). Instead you should be looking at the logic in EnsureRequirements

Re: Preserving partitioning with dataframe select

2016-02-08 Thread Matt Cheah
be appreciated! -Matt Cheah From: Reynold Xin Date: Sunday, February 7, 2016 at 11:11 PM To: Matt Cheah Cc: "dev@spark.apache.org" , Mingyu Kim Subject: Re: Preserving partitioning with dataframe select Matt, Thanks for the email. Are you just asking whether it shoul

Re: Preserving partitioning with dataframe select

2016-02-07 Thread Reynold Xin
Matt, Thanks for the email. Are you just asking whether it should work, or reporting they don't work? Internally, the way we track physical data distribution should make the scenarios described work. If it doesn't, we should make them work. On Sat, Feb 6, 2016 at 6:49 AM, Matt Cheah wrote: >

Preserving partitioning with dataframe select

2016-02-05 Thread Matt Cheah
Hi everyone, When using raw RDDs, it is possible to have a map() operation indicate that the partitioning for the RDD would be preserved by the map operation. This makes it easier to reduce the overhead of shuffles by ensuring that RDDs are co-partitioned when they are joined. When I'm using D