Mayur, To be a little clearer, for creating the Bloom Filters, I don't think broadcast variables are the way to go, though definitely that would work for using the Bloom Filters to filter data.
The reason why is that the creation needs to happen in a single thread. Otherwise, some type of locking/distributed locking is needed on the individual Bloom Filter itself, with performance impact. Agreed? -Suren On Thu, Mar 20, 2014 at 3:40 PM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Mayur, > > Thanks. This step is for creating the Bloom Filter, not using it to filter > data, actually. But your answer still stands. > > Partitioning by key, having the bloom filters as a broadcast variable and > then doing mappartition makes sense. > > Are there performance implications for this approach, such as with using > the broadcast variable, versus the approach we used, in which the Bloom > Filter (again, for creating it) is only referenced by the single map > application? > > -Suren > > > > > > On Thu, Mar 20, 2014 at 3:20 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > >> Why are you trying to reducebyKey? Are you looking to work on the data >> sequentially. >> If I understand correctly you are looking to filter your data using the >> bloom filter & each bloom filter is tied to which key is instantiating it. >> Following are some of the options >> *partiition* your data by key & use mappartition operator to run function >> on partition independently. The same function will be applied to each >> partition. >> If your bloomfilter is large then you can bundle all of them in as a >> broadcast variable & use it to apply the transformation on your data using >> a simple map operation, basically you are looking up the right bloom filter >> on each key & applying the filter on it, again here if unserializing bloom >> filter is time consuming then you can partition the data on key & then use >> the broadcast variable to look up the bloom filter for each key & apply >> filter on all data in serial. >> >> Regards >> Mayur >> >> Mayur Rustagi >> Ph: +1 (760) 203 3257 >> http://www.sigmoidanalytics.com >> @mayur_rustagi <https://twitter.com/mayur_rustagi> >> >> >> >> On Thu, Mar 20, 2014 at 1:55 PM, Surendranauth Hiraman < >> suren.hira...@velos.io> wrote: >> >>> We ended up going with: >>> >>> map() - set the group_id as the key in a Tuple >>> reduceByKey() - end up with (K,Seq[V]) >>> map() - create the bloom filter and loop through the Seq and persist the >>> Bloom filter >>> >>> This seems to be fine. >>> >>> I guess Spark cannot optimize the reduceByKey and map steps to occur >>> together since the fact that we are looping through the Seq is out of >>> Spark's control. >>> >>> -Suren >>> >>> >>> >>> >>> On Thu, Mar 20, 2014 at 9:48 AM, Surendranauth Hiraman < >>> suren.hira...@velos.io> wrote: >>> >>>> Hi, >>>> >>>> My team is trying to replicate an existing Map/Reduce process in Spark. >>>> >>>> Basically, we are creating Bloom Filters for quick set membership tests >>>> within our processing pipeline. >>>> >>>> We have a single column (call it group_id) that we use to partition >>>> into sets. >>>> >>>> As you would expect, in the map phase, we emit the group_id as the key >>>> and in the reduce phase, we instantiate the Bloom Filter for a given key in >>>> the setup() method and persist that Bloom Filter in the cleanup() method. >>>> >>>> In Spark, we can do something similar with map() and reduceByKey() but >>>> we have the following questions. >>>> >>>> >>>> 1. Accessing the reduce key >>>> In reduceByKey(), how do we get access to the specific key within the >>>> reduce function? >>>> >>>> >>>> 2. Equivalent of setup/cleanup >>>> Where should we instantiate and persist each Bloom Filter by key? In >>>> the driver and then pass in the references to the reduce function? But if >>>> so, how does the reduce function know which set's Bloom Filter it should be >>>> writing to (question 1 above)? >>>> >>>> It seems if we use groupByKey and then reduceByKey, that gives us >>>> access to all of the values at one go. I assume there, Spark will manage if >>>> those values all don't fit in memory in one go. >>>> >>>> >>>> >>>> SUREN HIRAMAN, VP TECHNOLOGY >>>> Velos >>>> Accelerating Machine Learning >>>> >>>> 440 NINTH AVENUE, 11TH FLOOR >>>> NEW YORK, NY 10001 >>>> O: (917) 525-2466 ext. 105 >>>> F: 646.349.4063 >>>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >>>> W: www.velos.io >>>> >>>> >>>> >>> >>> >>> -- >>> >>> SUREN HIRAMAN, VP TECHNOLOGY >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR >>> NEW YORK, NY 10001 >>> O: (917) 525-2466 ext. 105 >>> F: 646.349.4063 >>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >>> W: www.velos.io >>> >>> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io > W: www.velos.io > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io W: www.velos.io