Re: Sort Merge Join from the filesystem

Alex Nastetsky Mon, 16 Nov 2015 09:41:47 -0800

Done, thanks.

On Mon, Nov 9, 2015 at 7:23 PM, Cheng, Hao <[email protected]> wrote:


> Yes, we definitely need to think how to handle this case, probably even
> more common than both sorted/partitioned tables case, can you jump to the
> jira and leave comment there?
>
>
>
> *From:* Alex Nastetsky [mailto:[email protected]]
> *Sent:* Tuesday, November 10, 2015 3:03 AM
> *To:* Cheng, Hao
> *Cc:* Reynold Xin; [email protected]
> *Subject:* Re: Sort Merge Join from the filesystem
>
>
>
> Thanks for creating that ticket.
>
>
>
> Another thing I was thinking of, is doing this type of join between
> dataset A which is already partitioned/sorted on disk and dataset B, which
> gets generated during the run of the application.
>
>
>
> Dataset B would need something like repartitionAndSortWithinPartitions to
> be performed on it, using the same partitioner that was used with dataset
> A. Then dataset B could be joined with dataset A without needing to write
> it to disk first (unless it's too big to fit in memory, then it would need
> to be [partially] spilled).
>
>
>
> On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao <[email protected]> wrote:
>
> Yes, we probably need more change for the data source API if we need to
> implement it in a generic way.
>
> BTW, I create the JIRA by copy most of words from Alex. J
>
>
>
> https://issues.apache.org/jira/browse/SPARK-11512
>
>
>
>
>
> *From:* Reynold Xin [mailto:[email protected]]
> *Sent:* Thursday, November 5, 2015 1:36 AM
> *To:* Alex Nastetsky
> *Cc:* [email protected]
> *Subject:* Re: Sort Merge Join from the filesystem
>
>
>
> It's not supported yet, and not sure if there is a ticket for it. I don't
> think there is anything fundamentally hard here either.
>
>
>
>
>
> On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <
> [email protected]> wrote:
>
> (this is kind of a cross-post from the user list)
>
>
>
> Does Spark support doing a sort merge join on two datasets on the file
> system that have already been partitioned the same with the same number of
> partitions and sorted within each partition, without needing to
> repartition/sort them again?
>
>
>
> This functionality exists in
>
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
>
> - Pig (USING 'merge')
>
> - MapReduce (CompositeInputFormat)
>
>
>
> If this is not supported in Spark, is a ticket already open for it? Does
> the Spark architecture present unique difficulties to having this feature?
>
>
>
> It is very useful to have this ability, as you can prepare dataset A to be
> joined with dataset B before B even exists, by pre-processing A with a
> partition/sort.
>
>
>
> Thanks.
>
>
>
>
>

Re: Sort Merge Join from the filesystem

Reply via email to