Hi,
We are trying to join two sets of data. One of them are smaller and pretty
stable. The other data set is volatile and much larger. But neither can be
loaded in memory.

So our idea is to pre-sort the smaller data set, cache them in multiple
partitions. Any we use the same logic to sort the larger data set and
partition them. So in theory, the same key range records will be in the
same node after the above steps are done. And we them join them together.
By doing so, we only need to sort and partition the larger data set every
time it gets in.

The question is: is there an implementation for this in Spark already? I
know this can be done in Hadoop M/R using CompositeInputFormat. But not
sure if Spark has the counterpart already existed.

Many thanks.


Bill


-- 
Many thanks.


Bill

Reply via email to