If you use mapPartitions to iterate the lookup_tables does that improve the
performance?

This link is to Spark docs 1.1 because both latest and 1.2 for Python give
me a 404:
http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions

On Wed Feb 11 2015 at 1:48:42 PM rok <rokros...@gmail.com> wrote:

> I was having trouble with memory exceptions when broadcasting a large
> lookup
> table, so I've resorted to processing it iteratively -- but how can I
> modify
> an RDD iteratively?
>
> I'm trying something like :
>
> rdd = sc.parallelize(...)
> lookup_tables = {...}
>
> for lookup_table in lookup_tables :
>     rdd = rdd.map(lambda x: func(x, lookup_table))
>
> If I leave it as is, then only the last "lookup_table" is applied instead
> of
> stringing together all the maps. However, if add a .cache() to the .map
> then
> it seems to work fine.
>
> A second problem is that the runtime for each iteration roughly doubles at
> each iteration so this clearly doesn't seem to be the way to do it. What is
> the preferred way of doing such repeated modifications to an RDD and how
> can
> the accumulation of overhead be minimized?
>
> Thanks!
>
> Rok
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/iteratively-modifying-an-RDD-tp21606.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to