Re: Efficiently doing an analysis with Cartesian product (pyspark)

Aaron Mon, 23 Jun 2014 15:04:26 -0700

Sorry, I got my sample outputs wrong

(1,1) -> 400
(1,2) -> 500
(2,2)-> 600


On Jun 23, 2014, at 4:29 PM, "Aaron Dossett [via Apache Spark User List]" 
<ml-node+s1001560n8144...@n3.nabble.com<mailto:ml-node+s1001560n8144...@n3.nabble.com>>
 wrote:

I am relatively new to Spark and am getting stuck trying to do the following:

- My input is integer key, value pairs where the key is not unique.  I'm 
interested in information about all possible distinct key combinations, thus 
the Cartesian product.
- My first attempt was to create a separate RDD of this cartesian product and 
then use map() to calculate the data.  However, I was trying to pass another 
RDD to the function map was calling, which I eventually figured out was causing 
a run time error, even if the function I called with map did nothing.  Here's a 
simple code example:

-------
def somefunc(x, y, RDD):
  return 0

input = sc.parallelize([(1,100), (1,200), (2, 100), (2,300)])

#Create all pairs of keys, including self-pairs
itemPairs = input.map(lambda x: x[0]).distinct()
itemPairs = itemPairs.cartesian(itemPairs)

print itemPairs.collect()

TC = itemPairs.map(lambda x: (x, somefunc(x[0], x[1], input)))

print TC.collect()
------

I'm assuming this isn't working because it isn't a very Spark-like way to do 
things and I could imagine that passing RDDs into other RDD's map functions 
might not make sense.  Could someone suggest to me a way to apply 
transformations and actions to "input" that would produce a mapping of key 
pairs to some information related to the values.

For example, I might want to (1, 2) to map to the sum of the maximum values 
found for each key in the input (500 in my sample data above).  Extending that 
example (1,1) would map to 300 and (2,2) to 400.

Please let me know if I should provide more details or a more robust example.

Thank you, Aaron

________________________________
If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Efficiently-doing-an-analysis-with-Cartesian-product-pyspark-tp8144.html
This email was sent by Aaron 
Dossett<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=1353>
 (via Nabble)
To receive all replies by email, subscribe to this 
discussion<http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=subscribe_by_code&node=8144&code=YWFyb24uZG9zc2V0dEB0YXJnZXQuY29tfDgxNDR8MTM3NjcxOTg5>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficiently-doing-an-analysis-with-Cartesian-product-pyspark-tp8144p8145.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Efficiently doing an analysis with Cartesian product (pyspark)

Reply via email to