Hello all, I'm using GraphX (1.1.0) to process RDF-data. I want to build an graph out of the data from the Berlin Benchmark ( BSBM <http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/> ). The steps that I'm doing to load the data into a graph are:
*1.* Split the RDF triples *2.* Get all nodes (union subjects and objects and then distinct them, /NodesRDD/) *3.* Zip the nodes (NodesRDD) with "zipWithUniqueId" -> /ZippedNodesRDD/ *4.* Join the subjects and objects with the predicate to get the corresponding ids for the nodes to build the edges *5.* Build the graph nodes out of the /ZippedNodesRDD/, create the Java node attribute *6.*Build the GraphX graph My problem is that my nodes (/graph.vertices/) in the graph have different ids than the nodes (/ZippedNodesRDD/) which I use to build the edges. I don't know why because I build the final nodes out of the same RDD which I use to join and this RDD is cached. *For example:* graph.vertices says: ID: 35255, Attribute: bsbm-inst:dataFromVendor33/Offer62164/ ZippedNodesRDD says: ID: 35254 Attribute: bsbm-inst:dataFromVendor33/Offer62164/ I have no idea why that happens, because the joining is correct only the ids are wrong. Thanks in Advance -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Different-Vertex-Ids-in-Graph-and-Edges-tp20632.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org