For what it's worth, I got it to work with a Cartesian product even if it's very inefficient it worked out alright for me. The trick was to flat map it (step4) after the cartesian product so that I could do a reduce by key and find the commonalities. After I was done, I could check if any Value pair had a matching value in any other value pair. If yes, I run it another time.
The process is something like this: SUBSTEP 1: CARTESIAN + FILTER( non inclusive set : False ) SET: ((frozenset(['A']), frozenset([1, 2])), (frozenset(['A']), frozenset([1, 2]))) SET: ((frozenset(['A']), frozenset([1, 2])), (frozenset(['B']), frozenset([2, 3]))) SET: ((frozenset(['A']), frozenset([1, 2])), (frozenset(['S']), frozenset([1, 2, 100]))) SET: ((frozenset(['B']), frozenset([2, 3])), (frozenset(['A']), frozenset([1, 2]))) SET: ((frozenset(['B']), frozenset([2, 3])), (frozenset(['B']), frozenset([2, 3]))) SET: ((frozenset(['B']), frozenset([2, 3])), (frozenset(['C']), frozenset([3, 4]))) SET: ((frozenset(['B']), frozenset([2, 3])), (frozenset(['S']), frozenset([1, 2, 100]))) SET: ((frozenset(['C']), frozenset([3, 4])), (frozenset(['B']), frozenset([2, 3]))) SET: ((frozenset(['C']), frozenset([3, 4])), (frozenset(['C']), frozenset([3, 4]))) SET: ((frozenset(['G']), frozenset([10, 20])), (frozenset(['G']), frozenset([10, 20]))) SET: ((frozenset(['G']), frozenset([10, 20])), (frozenset(['Z']), frozenset([1000, 20]))) SET: ((frozenset(['Z']), frozenset([1000, 20])), (frozenset(['G']), frozenset([10, 20]))) SET: ((frozenset(['Z']), frozenset([1000, 20])), (frozenset(['Z']), frozenset([1000, 20]))) SET: ((frozenset(['S']), frozenset([1, 2, 100])), (frozenset(['A']), frozenset([1, 2]))) SET: ((frozenset(['S']), frozenset([1, 2, 100])), (frozenset(['B']), frozenset([2, 3]))) SET: ((frozenset(['S']), frozenset([1, 2, 100])), (frozenset(['S']), frozenset([1, 2, 100]))) SUBSTEP 2 : MERGE SET: (frozenset(['A']), frozenset([1, 2])) SET: (frozenset(['A', 'B']), frozenset([1, 2, 3])) SET: (frozenset(['A', 'S']), frozenset([1, 2, 100])) SET: (frozenset(['A', 'B']), frozenset([1, 2, 3])) SET: (frozenset(['B']), frozenset([2, 3])) SET: (frozenset(['C', 'B']), frozenset([2, 3, 4])) SET: (frozenset(['S', 'B']), frozenset([1, 2, 3, 100])) SET: (frozenset(['C', 'B']), frozenset([2, 3, 4])) SET: (frozenset(['C']), frozenset([3, 4])) SET: (frozenset(['G']), frozenset([10, 20])) SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20])) SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20])) SET: (frozenset(['Z']), frozenset([1000, 20])) SET: (frozenset(['A', 'S']), frozenset([1, 2, 100])) SET: (frozenset(['S', 'B']), frozenset([1, 2, 3, 100])) SET: (frozenset(['S']), frozenset([1, 2, 100])) SUBSTEP 3 : DISTINCT SET: (frozenset(['A']), frozenset([1, 2])) SET: (frozenset(['C']), frozenset([3, 4])) SET: (frozenset(['S']), frozenset([1, 2, 100])) SET: (frozenset(['A', 'S']), frozenset([1, 2, 100])) SET: (frozenset(['A', 'B']), frozenset([1, 2, 3])) SET: (frozenset(['B']), frozenset([2, 3])) SET: (frozenset(['S', 'B']), frozenset([1, 2, 3, 100])) SET: (frozenset(['G']), frozenset([10, 20])) SET: (frozenset(['C', 'B']), frozenset([2, 3, 4])) SET: (frozenset(['Z']), frozenset([1000, 20])) SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20])) SUBSTEP 4: flatmap SET: ('A', (frozenset(['A']), frozenset([1, 2]))) SET: ('C', (frozenset(['C']), frozenset([3, 4]))) SET: ('S', (frozenset(['S']), frozenset([1, 2, 100]))) SET: ('A', (frozenset(['A', 'S']), frozenset([1, 2, 100]))) SET: ('S', (frozenset(['A', 'S']), frozenset([1, 2, 100]))) SET: ('A', (frozenset(['A', 'B']), frozenset([1, 2, 3]))) SET: ('B', (frozenset(['A', 'B']), frozenset([1, 2, 3]))) SET: ('B', (frozenset(['B']), frozenset([2, 3]))) SET: ('S', (frozenset(['S', 'B']), frozenset([1, 2, 3, 100]))) SET: ('B', (frozenset(['S', 'B']), frozenset([1, 2, 3, 100]))) SET: ('G', (frozenset(['G']), frozenset([10, 20]))) SET: ('C', (frozenset(['C', 'B']), frozenset([2, 3, 4]))) SET: ('B', (frozenset(['C', 'B']), frozenset([2, 3, 4]))) SET: ('Z', (frozenset(['Z']), frozenset([1000, 20]))) SET: ('Z', (frozenset(['Z', 'G']), frozenset([1000, 10, 20]))) SET: ('G', (frozenset(['Z', 'G']), frozenset([1000, 10, 20]))) SUBSTEP 5: reduceByKey SET: ('A', (frozenset(['A', 'S', 'B']), frozenset([1, 2, 3, 100]))) SET: ('C', (frozenset(['C', 'B']), frozenset([2, 3, 4]))) SET: ('B', (frozenset(['A', 'S', 'B', 'C']), frozenset([1, 2, 3, 100, 4]))) SET: ('G', (frozenset(['Z', 'G']), frozenset([1000, 10, 20]))) SET: ('S', (frozenset(['A', 'S', 'B']), frozenset([1, 2, 3, 100]))) SET: ('Z', (frozenset(['Z', 'G']), frozenset([1000, 10, 20]))) SUBSTEP 6: map SET: (frozenset(['A', 'S', 'B']), frozenset([1, 2, 3, 100])) SET: (frozenset(['C', 'B']), frozenset([2, 3, 4])) SET: (frozenset(['A', 'S', 'B', 'C']), frozenset([1, 2, 3, 100, 4])) SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20])) SET: (frozenset(['A', 'S', 'B']), frozenset([1, 2, 3, 100])) SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20])) SUBSTEP 7: distinct SET: (frozenset(['A', 'S', 'B']), frozenset([1, 2, 3, 100])) SET: (frozenset(['A', 'S', 'B', 'C']), frozenset([1, 2, 3, 100, 4])) SET: (frozenset(['Z', 'G']), frozenset([1000, 10, 20])) SET: (frozenset(['C', 'B']), frozenset([2, 3, 4])) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-in-merging-a-RDD-agaisnt-itself-using-the-V-of-a-K-V-tp10530p10560.html Sent from the Apache Spark User List mailing list archive at Nabble.com.