You could try using an external key value store (like HBase, Redis) and
perform lookups/updates inside of your mappers (you'd need to create the
connection within a mapPartitions code block to avoid the connection
setup/teardown overhead)?

I haven't done this myself though, so I'm just throwing the idea out there.

On Fri, Aug 28, 2015 at 3:39 AM Hemminger Jeff <j...@atware.co.jp> wrote:

> Hi,
>
> I am working on a Spark application that is using of a large (~3G)
> broadcast variable as a lookup table. The application refines the data in
> this lookup table in an iterative manner. So this large variable is
> broadcast many times during the lifetime of the application process.
>
> From what I have observed perhaps 60% of the execution time is spent
> waiting for the variable to broadcast in each iteration. My reading of a
> Spark performance article[1] suggests that the time spent broadcasting will
> increase with the number of nodes I add.
>
> My question for the group - what would you suggest as an alternative to
> broadcasting a large variable like this?
>
> One approach I have considered is segmenting my RDD and adding a copy of
> the lookup table for each X number of values to process. So, for example,
> if I have a list of 1 million entries to process (eg, RDD[Entry]), I could
> split this into segments of 100K entries, with a copy of the lookup table,
> and make that an RDD[(Lookup, Array[Entry]).
>
> Another solution I am looking at it is making the lookup table an RDD
> instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to
> improve performance. One issue with this approach is that I would have to
> rewrite my application code to use two RDDs so that I do not reference the
> lookup RDD in the from within the closure of another RDD.
>
> Any other recommendations?
>
> Jeff
>
>
> [1]
> http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf
>
> [2]https://github.com/amplab/spark-indexedrdd
>

Reply via email to