How to merge fragmented IDs into one cluster if one/more IDs are shared

Tushar Sudake Thu, 05 Oct 2017 12:41:29 -0700

Hello Sparkans,

I want to merge following cluster / set of IDs into one if they have shared
IDs.


For example:

uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4
uuid_3_2,uuid_3_5,uuid_3_6
uuid_3_5,uuid_3_7,uuid_3_8,uuid_3_9

into single:

uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4,uuid_3_5,uuid_3_6,uuid_3_7,uuid_3_8,uuid_3_9

because they're linked through 'uuid_3_2' and 'uuid_3_5'.

How can I do this in Spark?

One solution I can think of is to use Graphx. Keep adding links between two
IDs and Graphx will take care of creating clusters. But these are UUIDs and
Graphx only supports Long for VertexID. Also, my input data is huge (50 M
Unique IDs), so maintaining collision free map of UUID <-> Long will be
tough.

Any suggestions?

Thanks!

How to merge fragmented IDs into one cluster if one/more IDs are shared

Reply via email to