Re: How to merge fragmented IDs into one cluster if one/more IDs are shared

孫澤恩 Thu, 05 Oct 2017 19:32:22 -0700

Hi there,

About GraphX, i thing that the graph process is parse you data into (VertexA) - 
[Edge1] - (VertexB). 
As we see the Graph class of GraphX contains edges and vertices.


Such that, in the first line of your data would be parse to 

uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_3 as vertices.
(uuid_3_1,uuid_3_2),(uuid_3_2,uuid_3_3),(uuid_3_3,uuid_3_4) as edges.

It could be make in single result as your want but I think there should be a 
better way than GraphX.

And if you want to use GrpahX as the solution then there is a way that I used 
to convert uuid to long.
You could use a hash encode and decode function to convert sting to long type 
or convert it back.

Hope that would help you.

Sean Sun


> On 6 Oct 2017, at 3:35 AM, Tushar Sudake <etusha...@gmail.com> wrote:
> 
> Hello Sparkans,
> 
> I want to merge following cluster / set of IDs into one if they have shared 
> IDs.
> 
> For example:
> 
> uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4
> uuid_3_2,uuid_3_5,uuid_3_6
> uuid_3_5,uuid_3_7,uuid_3_8,uuid_3_9
> into single:
> 
> uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4,uuid_3_5,uuid_3_6,uuid_3_7,uuid_3_8,uuid_3_9
> because they're linked through 'uuid_3_2' and 'uuid_3_5'.
> 
> How can I do this in Spark?
> 
> One solution I can think of is to use Graphx. Keep adding links between two 
> IDs and Graphx will take care of creating clusters. But these are UUIDs and 
> Graphx only supports Long for VertexID. Also, my input data is huge (50 M 
> Unique IDs), so maintaining collision free map of UUID <-> Long will be tough.
> 
> Any suggestions?
> 
> Thanks!

Re: How to merge fragmented IDs into one cluster if one/more IDs are shared

Reply via email to