Re: DataSourceV2 write input requirements

2018-03-27 Thread Russell Spitzer
Thanks for the clarification, definitely would want to require Sort but only recommend partitioning ... I think that would be useful to request based on details about the incoming dataset. On Tue, Mar 27, 2018 at 4:55 PM Ryan Blue wrote: > A required clustering would not, but a required sort wo

Re: DataSourceV2 write input requirements

2018-03-27 Thread Ryan Blue
A required clustering would not, but a required sort would. Clustering is asking for the input dataframe's partitioning, and sorting would be how each partition is sorted. On Tue, Mar 27, 2018 at 4:53 PM, Russell Spitzer wrote: > I forgot since it's been a while, but does Clustering support allo

Re: DataSourceV2 write input requirements

2018-03-27 Thread Russell Spitzer
I forgot since it's been a while, but does Clustering support allow requesting that partitions contain elements in order as well? That would be a useful trick for me. IE Request/Require(SortedOn(Col1)) Partition 1 -> ((A,1), (A, 2), (B,1) , (B,2) , (C,1) , (C,2)) On Tue, Mar 27, 2018 at 4:38 PM Ry

Re: DataSourceV2 write input requirements

2018-03-27 Thread Ryan Blue
Thanks, it makes sense that the existing interface is for aggregation and not joins. Why are there requirements for the number of partitions that are returned then? Does it makes sense to design the write-side `Requirement` classes and the read-side reporting separately? On Tue, Mar 27, 2018 at 3

Re: DataSourceV2 write input requirements

2018-03-27 Thread Wenchen Fan
Hi Ryan, yea you are right that SupportsReportPartitioning doesn't expose hash function, so Join can't benefit from this interface, as Join doesn't require a general ClusteredDistribution, but a more specific one called HashClusteredDistribution. So currently only Aggregate can benefit from Suppor

Re: [Spark R] Proposal: Exposing RBackend in RRunner

2018-03-27 Thread Jeremy Liu
Spark Dev, On second thought, the below topic seems more appropriate for spark-dev rather than spark-users: Spark Users, > > In SparkR, RBackend is created in RRunner.main(). This in particular makes > it difficult to control or use the RBackend. For my use case, I am looking > to access the JVMO

Re: DataSourceV2 write input requirements

2018-03-27 Thread Ryan Blue
I just took a look at SupportsReportPartitioning and I'm not sure that it will work for real use cases. It doesn't specify, as far as I can tell, a hash function for combining clusters into tasks or a way to provide Spark a hash function for the other side of a join. It seems unlikely to me that ma