Hi, If I understand correctly: rdd1 contains keys (of type StringDate) rdd2 contains keys and values and rdd3 contains all the keys, and the values from rdd2?
I think you should make rdd1 and rdd2 PairRDD, and then use outer join. Does that make sense? On Mon, Sep 21, 2015 at 8:37 PM Zhiliang Zhu <zchl.j...@yahoo.com> wrote: > Dear Romi, Priya, Sujt and Shivaram and all, > > I have took lots of days to think into this issue, however, without any > enough good solution... > I shall appreciate your all kind help. > > There is an RDD<StringDate> rdd1, and another RDD<StringDate, float> rdd2, > (rdd2 can be PairRDD, or DataFrame with two columns as <StringDate, float>). > StringDate column values from rdd1 and rdd2 are cross but not the same. > > I would like to get a new RDD<StringDate, float> rdd3, StringDate in rdd3 > would be all from (same) as rdd1, and float in rdd3 would be from rdd2 if > its > StringDate is in rdd2, or else NULL would be assigned. > each row in rdd3[ i ] = <rdd1[ i ].StringDate, rdd2[ i ].float or NULL>, > rdd2[i].StringDate would be same as rdd1[ i ].StringDate, > then rdd2[ i ].float is assigned rdd3[ i ] StringDate part. > What kinds of API or function would I use... > > Thanks very much! > Zhiliang > > >