I was having trouble with memory exceptions when broadcasting a large lookup table, so I've resorted to processing it iteratively -- but how can I modify an RDD iteratively?
I'm trying something like : rdd = sc.parallelize(...) lookup_tables = {...} for lookup_table in lookup_tables : rdd = rdd.map(lambda x: func(x, lookup_table)) If I leave it as is, then only the last "lookup_table" is applied instead of stringing together all the maps. However, if add a .cache() to the .map then it seems to work fine. A second problem is that the runtime for each iteration roughly doubles at each iteration so this clearly doesn't seem to be the way to do it. What is the preferred way of doing such repeated modifications to an RDD and how can the accumulation of overhead be minimized? Thanks! Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org