Hi there, About GraphX, i thing that the graph process is parse you data into (VertexA) - [Edge1] - (VertexB). As we see the Graph class of GraphX contains edges and vertices.
Such that, in the first line of your data would be parse to uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_3 as vertices. (uuid_3_1,uuid_3_2),(uuid_3_2,uuid_3_3),(uuid_3_3,uuid_3_4) as edges. It could be make in single result as your want but I think there should be a better way than GraphX. And if you want to use GrpahX as the solution then there is a way that I used to convert uuid to long. You could use a hash encode and decode function to convert sting to long type or convert it back. Hope that would help you. Sean Sun > On 6 Oct 2017, at 3:35 AM, Tushar Sudake <etusha...@gmail.com> wrote: > > Hello Sparkans, > > I want to merge following cluster / set of IDs into one if they have shared > IDs. > > For example: > > uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4 > uuid_3_2,uuid_3_5,uuid_3_6 > uuid_3_5,uuid_3_7,uuid_3_8,uuid_3_9 > into single: > > uuid_3_1,uuid_3_2,uuid_3_3,uuid_3_4,uuid_3_5,uuid_3_6,uuid_3_7,uuid_3_8,uuid_3_9 > because they're linked through 'uuid_3_2' and 'uuid_3_5'. > > How can I do this in Spark? > > One solution I can think of is to use Graphx. Keep adding links between two > IDs and Graphx will take care of creating clusters. But these are UUIDs and > Graphx only supports Long for VertexID. Also, my input data is huge (50 M > Unique IDs), so maintaining collision free map of UUID <-> Long will be tough. > > Any suggestions? > > Thanks!