Hello, I'm implementing MinHash for reccomendation on Flink. I'm almost done but I need an efficient way to merge sets of similar keys together (and later join these sets of keys with more data).
The actual data structure is of the form DataSet[(Int,Set[Int])] where the left element of the tuple is an ID for the right element, that is a set of keys. I want to merge these sets together only if they share at least one element. I'm rather sure to have studied the efficient solution to this problem in a local environment but I don't really know how to treat it in a distributed environment. Any suggestion? Thanks, Simone