Re: How to join two RDDs with mutually exclusive keys

Daniel Siegmann Thu, 20 Nov 2014 14:53:42 -0800

Harihar, your question is the opposite of what was asked. In the future,
please start a new thread for new questions.


You want to do a join in your case. The join function does an inner join,
which I think is what you want because you stated your IDs are common in
both RDDs. For other cases you can look at leftOuterJoin, rightOuterJoin,
and cogroup (alias groupWith). These are all on PairRDDFunctions (in Scala)
[1]

Alternatively, if one of your RDDs is small, you could collect it and
broadcast the collection, using it in functions on the other RDD. But I
don't think this will apply in your case because the number of records will
be equal in each RDD. [2]

[1]
http://spark.apache.org/docs/latest/programming-guide.html#working-with-key-value-pairs
[2]
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

On Thu, Nov 20, 2014 at 4:53 PM, Harihar Nahak <hna...@wynyardgroup.com>
wrote:

> I've similar type of issue, want to join two different type of RDD in one
> RDD
>
> file1.txt content (ID, counts)
> val x : RDD[Long, Int] = sc.textFile("file1.txt").map( line =>
> line.split(",")).map(row => (row(0).toLong, row(1).toInt)
> [(4407 ,40),
> (2064, 38),
> (7815 ,10),
> (5736,17),
> (8031,3)]
>
> Second RDD from : file2.txt contains (ID, name)
> val y: RDD[(Long, String)]    {where ID is common in both the RDDs}
> [(4407 ,Jhon),
> (2064, Maria),
> (7815 ,Casto),
> (5736,Ram),
> (8031,XYZ)]
>
> and I'm expecting result should be like this : [(ID, Name, Count)]
> [(4407 ,Jhon, 40),
> (2064, Maria, 38),
> (7815 ,Casto, 10),
> (5736,Ram, 17),
> (8031,XYZ, 3)]
>
>
> Any help will really appreciate. Thanks
>
>
>
>
> On 21 November 2014 09:18, dsiegmann [via Apache Spark User List] <[hidden
> email] <http://user/SendEmail.jtp?type=node&node=19423&i=0>> wrote:
>
>> You want to use RDD.union (or SparkContext.union for many RDDs). These
>> don't join on a key. Union doesn't really do anything itself, so it is low
>> overhead. Note that the combined RDD will have all the partitions of the
>> original RDDs, so you may want to coalesce after the union.
>>
>> val x = sc.parallelize(Seq( (1, 3), (2, 4) ))
>> val y = sc.parallelize(Seq( (3, 5), (4, 7) ))
>> val z = x.union(y)
>>
>> z.collect
>> res0: Array[(Int, Int)] = Array((1,3), (2,4), (3,5), (4,7))
>>
>>
>> On Thu, Nov 20, 2014 at 3:06 PM, Blind Faith <[hidden email]
>> <http://user/SendEmail.jtp?type=node&node=19419&i=0>> wrote:
>>
>>> Say I have two RDDs with the following values
>>>
>>> x = [(1, 3), (2, 4)]
>>>
>>> and
>>>
>>> y = [(3, 5), (4, 7)]
>>>
>>> and I want to have
>>>
>>> z = [(1, 3), (2, 4), (3, 5), (4, 7)]
>>>
>>> How can I achieve this. I know you can use outerJoin followed by map to
>>> achieve this, but is there a more direct way for this.
>>>
>>
>>
>>
>> --
>> Daniel Siegmann, Software Developer
>> Velos
>> Accelerating Machine Learning
>>
>> 54 W 40th St, New York, NY 10018
>> E: [hidden email] <http://user/SendEmail.jtp?type=node&node=19419&i=1>
>> W: www.velos.io
>>
>>
>> ------------------------------
>>  If you reply to this email, your message will be added to the
>> discussion below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-two-RDDs-with-mutually-exclusive-keys-tp19417p19419.html
>>  To start a new topic under Apache Spark User List, email [hidden email]
>> <http://user/SendEmail.jtp?type=node&node=19423&i=1>
>> To unsubscribe from Apache Spark User List, click here.
>> NAML
>> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>>
>
>
>
> --
> Regards,
> Harihar Nahak
> BigData Developer
> Wynyard
> [hidden email] <http://user/SendEmail.jtp?type=node&node=19423&i=2> |
> Extn: 8019
>  --Harihar
>
> ------------------------------
> View this message in context: Re: How to join two RDDs with mutually
> exclusive keys
> <http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-two-RDDs-with-mutually-exclusive-keys-tp19417p19423.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>



-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io

Re: How to join two RDDs with mutually exclusive keys

Reply via email to