I was having trouble with memory exceptions when broadcasting a large lookup
table, so I've resorted to processing it iteratively -- but how can I modify
an RDD iteratively? 

I'm trying something like :

rdd = sc.parallelize(...)
lookup_tables = {...}

for lookup_table in lookup_tables : 
    rdd = rdd.map(lambda x: func(x, lookup_table))

If I leave it as is, then only the last "lookup_table" is applied instead of
stringing together all the maps. However, if add a .cache() to the .map then
it seems to work fine. 

A second problem is that the runtime for each iteration roughly doubles at
each iteration so this clearly doesn't seem to be the way to do it. What is
the preferred way of doing such repeated modifications to an RDD and how can
the accumulation of overhead be minimized? 

Thanks!

Rok



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to