Hi Ian,
If I understand what you're after, you might find "zip" useful. From the docs:
Zips this RDD with another one, returning key-value pairs with the first
element in each RDD, second element in each RDD, etc. Assumes that the two RDDs
have the *same number of partitions* and the *same number of elements in each
partition* (e.g. one was made through a map on the other).
Here's a toy example:
>> val rdd1 = sc.parallelize(Array("name1", "name2", "name3"), 3)
>> val rdd2 = sc.parallelize(Array("sign1", "sign2", "sign3"), 3)
>> rdd1.collect()
Array[String] = Array(name1, name2, name3)
>> rdd2.collect()
Array[String] = Array(sign1, sign2, sign3)
>> rdd1.zip(rdd2).collect()
Array[(String, String)] = Array((name1,sign1), (name2,sign2), (name3,sign3))
In your case, you might have the first two RDDs calculated from some common raw
data through a map.
-- Jeremy
---------------------
Jeremy Freeman, PhD
Neuroscientist
@thefreemanlab
On Apr 19, 2014, at 12:59 AM, Ian Ferreira <[email protected]> wrote:
>
> This may seem contrived but, suppose I wanted to create a collection of
> "single column" RDD's that contain calculated values, so I want to cache
> these to avoid re-calc.
>
> i.e.
>
> rdd1 = {Names]
> rdd2 = {Star Sign}
> rdd3 = {Age}
>
> Then I want to create a new virtual RDD that is a collection of these RDD's
> to create a "multi-column" RDD
>
> rddA = {Names, Age}
> rddB = {Names, Star Sign}
>
> I saw that rdd.union() merges rows, but anything that can combine columns?
>
> Cheers
> - Ian