Hi Terence,
which implementation are you using? I tested it and the results look very
good
id --- result value --------- percentage ----------- percentage
(wikipedia)
2: 3.5658816369034536 (38.43986817970977 %), 38.4%
3: 3.1809909923039688 (34.29078328331496 %), 34.3%
5: 0.7503491964913347 (8.088693663686792 %), 8.1%
6: 0.36259893900587814 (3.9087824097247115 %), 3.9%
4: 0.36259893900587814 (3.9087824097247115 %), 3.9%
1: 0.30409919649133466 (3.2781606954384332 %), 3.3%
10: 0.15 (1.6169858716801204 %), 1.6%
8: 0.15 (1.6169858716801204 %), 1.6%
9: 0.15 (1.6169858716801204 %), 1.6%
7: 0.15 (1.6169858716801204 %), 1.6%
11: 0.15 (1.6169858716801204 %), 1.6%
This is the coding I used:
val edges = Seq((2, 3), (3, 2), (4, 1), (4, 2), (5, 2), (5, 4), (5,
6), (6, 2), (6, 5), (7, 2), (7, 5), (8, 2), (8, 5), (9, 2), (9, 5),
(10, 5), (11, 5)).map((x) => new Edge[Int](x._1, x._2, 0))
val vertices = (1L to 11L).map(x => (x, x))
val graph = Graph[Long, Int](sc.parallelize(vertices), sc.parallelize(edges))
val res = graph.pageRank(0.00001)
val resV = res.vertices.collect()
val sum = resV.map(_._2).sum
println(resV.sortBy(_._2).reverse.map(x => s"${x._1}: ${x._2} (${x._2
/ sum * 100} %)").reduce((a, b) => s"$a\n$b"))
I used the latest master branch. I guess they only added the
personalized page rank since 1.2.
Hopefully this helps
Tarek
On Wed, Jun 24, 2015 at 9:39 PM Kelly, Terence P (HP Labs Researcher) <
[email protected]> wrote:
> Hi,
>
> Colleagues and I have found that the PageRank implementation bundled
> with Spark is incorrect in several ways. The code in question is in
> Apache Spark 1.2 distribution's "examples" directory, called
> "SparkPageRank.scala".
>
> Consider the example graph presented in the colorful figure on the
> Wikipedia page for "PageRank"; below is an edge list representation,
> where vertex "A" is "1", "B" is "2", etc.:
>
> - - - - - begin
> 2 3
> 3 2
> 4 1
> 4 2
> 5 2
> 5 4
> 5 6
> 6 2
> 6 5
> 7 2
> 7 5
> 8 2
> 8 5
> 9 2
> 9 5
> 10 5
> 11 5
> - - - - - end
>
> Here's the output we get from Spark's PageRank after 100 iterations:
>
> B has rank: 1.9184837009011475.
> C has rank: 1.7807113697064196.
> E has rank: 0.24301279014684984.
> A has rank: 0.24301279014684984.
> D has rank: 0.21885362387494078.
> F has rank: 0.21885362387494078.
>
> There are three problems with the output:
>
> 1. Only six of the eleven vertices are represented in the output;
> by definition, PageRank assigns a value to each vertex.
>
> 2. The values do not sum to 1.0; by definition, PageRank is a
> probability vector with one element per vertex and the sum of
> the elements of the vector must be 1.0.
>
> 3. Vertices E and A receive the same PageRank, whereas other means of
> computing PageRank, e.g., our own homebrew code and the method
> used by Wikipedia, assign different values to these vertices. Our
> own code has been compared against the PageRank implementation in
> the "NetworkX" package and it agrees.
>
> It looks like bug #1 is due to the Spark implementation of PageRank
> not emitting output for vertices with no incoming edges and bug #3 is
> due to the code not correctly handling vertices with no outgoing
> edges. Once #1 and #3 are fixed, normalization might be all that's
> required to fix #2 (maybe).
>
> We currently rely on the Spark PageRank for tests we're
> conducting; when do you think a fix might be ready?
>
> Thanks.
>
> -- Terence Kelly, HP Labs
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>