Hi Sean, This is what I intend to do:
"are you saying that you know a key should be filtered based on its value partway through the merge?" I should use combineByKey... Thanks. Deb On Thu, Feb 19, 2015 at 6:31 AM, Sean Owen <so...@cloudera.com> wrote: > You have the keys before and after reduceByKey. You want to do > something based on the key "within" reduceByKey? it just calls > combineByKey, so you can use that method for lower-level control over > the merging. > > Whether it's possible depends I suppose on what you mean to filter on. > If it's just a property of the key, can't you just filter before > reduceByKey? If it's a property of the key's value, don't you need to > wait for the reduction to finish? or are you saying that you know a > key should be filtered based on its value partway through the merge? > > I suppose you can use combineByKey to create a mergeValue function > that changes an input type A into some other Option[B]; you output > None if your criteria is reached, and your combine function returns > None if either argument is None? it doesn't save 100% of the work but > it may mean you only shuffle (key,None) for some keys if the map-side > combine already worked out that the key would be filtered. > > And then after, run a flatMap or something to make Option[B] into B. > > On Thu, Feb 19, 2015 at 2:21 PM, Debasish Das <debasish.da...@gmail.com> > wrote: > > Hi, > > > > Before I send out the keys for network shuffle, in reduceByKey after map > + > > combine are done, I would like to filter the keys based on some > threshold... > > > > Is there a way to get the key, value after map+combine stages so that I > can > > run a filter on the keys ? > > > > Thanks. > > Deb >