Done, thanks. On Mon, Nov 9, 2015 at 7:23 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
> Yes, we definitely need to think how to handle this case, probably even > more common than both sorted/partitioned tables case, can you jump to the > jira and leave comment there? > > > > *From:* Alex Nastetsky [mailto:alex.nastet...@vervemobile.com] > *Sent:* Tuesday, November 10, 2015 3:03 AM > *To:* Cheng, Hao > *Cc:* Reynold Xin; dev@spark.apache.org > *Subject:* Re: Sort Merge Join from the filesystem > > > > Thanks for creating that ticket. > > > > Another thing I was thinking of, is doing this type of join between > dataset A which is already partitioned/sorted on disk and dataset B, which > gets generated during the run of the application. > > > > Dataset B would need something like repartitionAndSortWithinPartitions to > be performed on it, using the same partitioner that was used with dataset > A. Then dataset B could be joined with dataset A without needing to write > it to disk first (unless it's too big to fit in memory, then it would need > to be [partially] spilled). > > > > On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao <hao.ch...@intel.com> wrote: > > Yes, we probably need more change for the data source API if we need to > implement it in a generic way. > > BTW, I create the JIRA by copy most of words from Alex. J > > > > https://issues.apache.org/jira/browse/SPARK-11512 > > > > > > *From:* Reynold Xin [mailto:r...@databricks.com] > *Sent:* Thursday, November 5, 2015 1:36 AM > *To:* Alex Nastetsky > *Cc:* dev@spark.apache.org > *Subject:* Re: Sort Merge Join from the filesystem > > > > It's not supported yet, and not sure if there is a ticket for it. I don't > think there is anything fundamentally hard here either. > > > > > > On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky < > alex.nastet...@vervemobile.com> wrote: > > (this is kind of a cross-post from the user list) > > > > Does Spark support doing a sort merge join on two datasets on the file > system that have already been partitioned the same with the same number of > partitions and sorted within each partition, without needing to > repartition/sort them again? > > > > This functionality exists in > > - Hive (hive.optimize.bucketmapjoin.sortedmerge) > > - Pig (USING 'merge') > > - MapReduce (CompositeInputFormat) > > > > If this is not supported in Spark, is a ticket already open for it? Does > the Spark architecture present unique difficulties to having this feature? > > > > It is very useful to have this ability, as you can prepare dataset A to be > joined with dataset B before B even exists, by pre-processing A with a > partition/sort. > > > > Thanks. > > > > >