Hi, We are trying to join two sets of data. One of them are smaller and pretty stable. The other data set is volatile and much larger. But neither can be loaded in memory.
So our idea is to pre-sort the smaller data set, cache them in multiple partitions. Any we use the same logic to sort the larger data set and partition them. So in theory, the same key range records will be in the same node after the above steps are done. And we them join them together. By doing so, we only need to sort and partition the larger data set every time it gets in. The question is: is there an implementation for this in Spark already? I know this can be done in Hadoop M/R using CompositeInputFormat. But not sure if Spark has the counterpart already existed. Many thanks. Bill -- Many thanks. Bill