Hello Sparkans,

I want to merge following cluster / set of IDs into one if they have shared
IDs.

For example:

uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4
uuid_3_2,uuid_3_5,uuid_3_6
uuid_3_5,uuid_3_7,uuid_3_8,uuid_3_9

into single:

uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4,uuid_3_5,uuid_3_6,uuid_3_7,uuid_3_8,uuid_3_9

because they're linked through 'uuid_3_2' and 'uuid_3_5'.

How can I do this in Spark?

One solution I can think of is to use Graphx. Keep adding links between two
IDs and Graphx will take care of creating clusters. But these are UUIDs and
Graphx only supports Long for VertexID. Also, my input data is huge (50 M
Unique IDs), so maintaining collision free map of UUID <-> Long will be
tough.

Any suggestions?

Thanks!

Reply via email to