Re: rdd.distinct with Partitioner

2016-06-08 Thread 汪洋
Frankly speaking, I think reduceByKey with Partitioner has the same problem too and it should not be exposed to public user either. Because it is a little hard to fully understand how the partitioner behaves without looking at the actual code. And if there exits a basic contract of a Partitio

Re: rdd.distinct with Partitioner

2016-06-08 Thread Alexander Pivovarov
reduceByKey(randomPartitioner, (a, b) => a + b) also gives incorrect result Why reduceByKey with Partitioner exists then? On Wed, Jun 8, 2016 at 9:22 PM, 汪洋 wrote: > Hi Alexander, > > I think it does not guarantee to be right if an arbitrary Partitioner is > passed in. > > I have created a note

Re: rdd.distinct with Partitioner

2016-06-08 Thread Mridul Muralidharan
The example violates the basic contract of a Partitioner. It does make sense to take Partitioner as a param to distinct - though it is fairly trivial to simulate that in user code as well ... Regards Mridul On Wednesday, June 8, 2016, 汪洋 wrote: > Hi Alexander, > > I think it does not guarantee

Re: rdd.distinct with Partitioner

2016-06-08 Thread 汪洋
Hi Alexander, I think it does not guarantee to be right if an arbitrary Partitioner is passed in. I have created a notebook and you can check it out. (https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366