If you use mapPartitions to iterate the lookup_tables does that improve the performance?
This link is to Spark docs 1.1 because both latest and 1.2 for Python give me a 404: http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions On Wed Feb 11 2015 at 1:48:42 PM rok <rokros...@gmail.com> wrote: > I was having trouble with memory exceptions when broadcasting a large > lookup > table, so I've resorted to processing it iteratively -- but how can I > modify > an RDD iteratively? > > I'm trying something like : > > rdd = sc.parallelize(...) > lookup_tables = {...} > > for lookup_table in lookup_tables : > rdd = rdd.map(lambda x: func(x, lookup_table)) > > If I leave it as is, then only the last "lookup_table" is applied instead > of > stringing together all the maps. However, if add a .cache() to the .map > then > it seems to work fine. > > A second problem is that the runtime for each iteration roughly doubles at > each iteration so this clearly doesn't seem to be the way to do it. What is > the preferred way of doing such repeated modifications to an RDD and how > can > the accumulation of overhead be minimized? > > Thanks! > > Rok > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >