Re: Spark join over sorted columns of dataset.

2017-03-12 Thread Li Jin
I am not an expert on this but here is what I think: Catalyst maintains information on whether a plan node is ordered. If your dataframe is a result of a order by, catalyst will skip the sorting when it does merge sort join. If you dataframe is created from storage, for instance. ParquetRelation,

Re: Spark join over sorted columns of dataset.

2017-03-03 Thread Koert Kuipers
For RDD the shuffle is already skipped but the sort is not. In spark-sorted we track partitioning and sorting within partitions for key-value RDDs and can avoid the sort. See: https://github.com/tresata/spark-sorted For Dataset/DataFrame such optimizations are done automatically, however it's curr

Re: Spark join over sorted columns of dataset.

2017-03-03 Thread Rohit Verma
Sending it to dev’s. Can you please help me providing some ideas for below. Regards Rohit > On Feb 23, 2017, at 3:47 PM, Rohit Verma wrote: > > Hi > > While joining two columns of different dataset, how to optimize join if both > the columns are pre sorted within the dataset. > So that when sp