That’s a good thing you pointed out. Let me check that. Thanks. Another thing I was struggling with is while this process of addition of vertices is happening with the graph(name is inputGraph) am not able to access it or perform query over it. Currently when I am querying the graph during the addition of vertices, its giving result after the addition is over. I have also tried with creating and querying another variable tempInputGraph where am storing state of inputGraph, which is updated whenever the addition process is over. But querying this is also being delayed due to the background process. I have set the number of executors as 8 as per my 8 core system. Please provide any suggestion as to how I can keep this graph always available to user even if any background process is happening over it. Let me know if it is possible or not as you said graphx is not really designed for real time needs.
If not graphX which other tool I can consider if I have real time needs. To elaborate I want to have a real time system which can store data as and when it is coming and I can query over it in real time. In present case I am using graphx. My data is entering my system via kafka and spark streaming and then its updating a graph of let’s say orders. One copy of this is sent to hbase where the data is persisted for later use. Now I want to query this graph for getting various insights in this orders data. I was using graphx because it’s really helpful to use graphs if we want to analyse related/connected information e.g. friends of friends and other stuffs. I really appreciate your valuable help Robin. Thank you In advance. Udbhav. From: Robin East [mailto:robin.e...@xense.co.uk] Sent: Thursday, February 25, 2016 7:42 PM To: Udbhav Agarwal <udbhav.agar...@syncoms.com> Cc: user@spark.apache.org Subject: Re: Reindexing in graphx So first up GraphX is not really designed for real-time graph mutation time situations. That’s not to say it can’t be done but you may be butting up against some of the design limitations in that area. As a first point of interrogation you should look at the WebUI to see what particular tasks/stages are taking a long time, and what resource (CPU, IO, network, shuffles) do they seem to be bottle-necking on. ------------------------------------------------------------------------------- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action On 24 Feb 2016, at 12:05, Udbhav Agarwal <udbhav.agar...@syncoms.com<mailto:udbhav.agar...@syncoms.com>> wrote: Sounds useful Robin. Thanks. I will try that. But fyi in another case I tested with adding only one vertex to the graph. In that case also the latency for subsequent addition was increasing like for first addition of a vertex its 3 seconds, then for second its 7 seconds and so on. This is a case when I want to add vertices to graph as and when they are coming in our system since it’s a real time system which I am trying to build so vertices will be keep on coming. Thanks. From: Robin East [mailto:robin.e...@xense.co.uk] Sent: Wednesday, February 24, 2016 3:54 PM To: Udbhav Agarwal <udbhav.agar...@syncoms.com<mailto:udbhav.agar...@syncoms.com>> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Reindexing in graphx It looks like you adding vertices one-by-one, you definitely don’t want to do that. What happens when you batch together 400 vertices into an RDD and then add 400 in one go? ------------------------------------------------------------------------------- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action On 24 Feb 2016, at 05:49, Udbhav Agarwal <udbhav.agar...@syncoms.com<mailto:udbhav.agar...@syncoms.com>> wrote: Thank you Robin for your reply. Actually I am adding bunch of vertices in a graph in graphx using the following method . I am facing the problem of latency. First time an addition of say 400 vertices to a graph with 100,000 nodes takes around 7 seconds. next time its taking 15 seconds. So every subsequent adds are taking more time than the previous one. Hence I tried to do reindex() so the subsequent operations can also be performed fast. FYI My cluster is presently having one machine with 8 core and 8 gb ram. I am running in local mode. def addVertex(rdd: RDD[String], sc: SparkContext, session: String): Long = { val defaultUser = (0, 0) rdd.collect().foreach { x => { val aVertex: RDD[(VertexId, (Int, Int))] = sc.parallelize(Array((x.toLong, (100, 100)))) gVertices = gVertices.union(aVertex) } } inputGraph = Graph(gVertices, gEdges, defaultUser) inputGraph.cache() gVertices = inputGraph.vertices gVertices.cache() val count = gVertices.count println(count); return 1; } From: Robin East [mailto:robin.e...@xense.co.uk] Sent: Tuesday, February 23, 2016 8:15 PM To: Udbhav Agarwal <udbhav.agar...@syncoms.com<mailto:udbhav.agar...@syncoms.com>> Subject: Re: Reindexing in graphx Hi Well this is the line that is failing in VertexRDDImpl: require(partitionsRDD.partitioner.isDefined) But really you shouldn’t need to be calling the reindex() function as it deals with some internals of the GraphX implementation - it looks to me like it ought to be a private method. Perhaps you could explain what you are trying to achieve. ------------------------------------------------------------------------------- Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action On 23 Feb 2016, at 12:18, Udbhav Agarwal <udbhav.agar...@syncoms.com<mailto:udbhav.agar...@syncoms.com>> wrote: Hi, I am trying to add vertices to a graph in graphx and I want to do reindexing in the graph. I can see there is an option of vertices.reindex() in graphX. But when I am doing graph.vertices.reindex() am getting Java.lang.IllegalArgumentException: requirement failed. Please help me know what I am missing with the syntax as I have seen the API documentation where only vertices.reindex() is mentioned. Thanks, Udbhav Agarwal