Hello Sparkans, I want to merge following cluster / set of IDs into one if they have shared IDs.
For example: uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4 uuid_3_2,uuid_3_5,uuid_3_6 uuid_3_5,uuid_3_7,uuid_3_8,uuid_3_9 into single: uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4,uuid_3_5,uuid_3_6,uuid_3_7,uuid_3_8,uuid_3_9 because they're linked through 'uuid_3_2' and 'uuid_3_5'. How can I do this in Spark? One solution I can think of is to use Graphx. Keep adding links between two IDs and Graphx will take care of creating clusters. But these are UUIDs and Graphx only supports Long for VertexID. Also, my input data is huge (50 M Unique IDs), so maintaining collision free map of UUID <-> Long will be tough. Any suggestions? Thanks!