I was able to get a couple relevant answers to this on StackOverflow:

http://stackoverflow.com/questions/34339300/nesting-parallelizations-in-spark-whats-the-right-approach/34340986#34340986

http://stackoverflow.com/questions/34386086/casting-long-to-double-inside-109-for-loop-really-bad-idea?noredirect=1#comment56515911_34386086

Apparently Scala allows the use of a Range inside parallelize() and Java 8
should have something similar but I have not tested it yet.

For the question about "nested for loops" in Spark ... you can't really do
that directly, but I am seeing there's always a way to think Sparkily and do
the same thing. For example in my case I wanted to create a 10^6 length RDD
and then iterate over it 1000 times. 

Still testing this out but the advice I got (the answer to the first SO
question above) is to build the RDD to be 1000*10^6 long and then iterate as
needed. I could use a mapByPartitions() instead of a map, and have each
subset of the main set be a partition.

This also brought up the idea that using a List to init an RDD has the limit
that your RDD can only have MAXINT elements, so if you NEED more than
2.147*10^9 RDD elements, I have it on good authority that the actual RDD
size limit is just your resources.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-meet-nested-loop-on-pairRdd-tp21121p25757.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to