You could try using an external key value store (like HBase, Redis) and perform lookups/updates inside of your mappers (you'd need to create the connection within a mapPartitions code block to avoid the connection setup/teardown overhead)?
I haven't done this myself though, so I'm just throwing the idea out there. On Fri, Aug 28, 2015 at 3:39 AM Hemminger Jeff <j...@atware.co.jp> wrote: > Hi, > > I am working on a Spark application that is using of a large (~3G) > broadcast variable as a lookup table. The application refines the data in > this lookup table in an iterative manner. So this large variable is > broadcast many times during the lifetime of the application process. > > From what I have observed perhaps 60% of the execution time is spent > waiting for the variable to broadcast in each iteration. My reading of a > Spark performance article[1] suggests that the time spent broadcasting will > increase with the number of nodes I add. > > My question for the group - what would you suggest as an alternative to > broadcasting a large variable like this? > > One approach I have considered is segmenting my RDD and adding a copy of > the lookup table for each X number of values to process. So, for example, > if I have a list of 1 million entries to process (eg, RDD[Entry]), I could > split this into segments of 100K entries, with a copy of the lookup table, > and make that an RDD[(Lookup, Array[Entry]). > > Another solution I am looking at it is making the lookup table an RDD > instead of a broadcast variable. Perhaps I could use an IndexedRDD[2] to > improve performance. One issue with this approach is that I would have to > rewrite my application code to use two RDDs so that I do not reference the > lookup RDD in the from within the closure of another RDD. > > Any other recommendations? > > Jeff > > > [1] > http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf > > [2]https://github.com/amplab/spark-indexedrdd >