While creating a graph with 6B nodes and 12B edges, I noticed that
*'numVertices' api returns incorrect result*; 'numEdges' reports correct
number. For few times(with different dataset > 2.5B nodes) I have also
notices that numVertices is returned as -ive number; so I suspect that there
is some overflow (may be we are using Int for some field?).

Environment: Standalone mode running on EC2 . Using latest code from master
branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .

Here is some details of experiments I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
Graph returns: numVertices=1807028297 ; numEdges=12163784626
2. Input : numNodes=*2157586441* ; noEdges=2747322705
Graph Returns: numVertices=*-2137380855* ; numEdges=2747322705
3. Input: numNodes=1725060105 ; noEdges=204176821
Graph: numVertices=1725060105 ; numEdges=2041768213 


You can find the code to generate this bug here:
https://gist.github.com/npanj/92e949d86d08715bf4bf

(I have also filed this jira ticket:
https://issues.apache.org/jira/browse/SPARK-3190)





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Graphx-seems-to-be-broken-while-Creating-a-large-graph-6B-nodes-in-my-case-tp7966.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to