Hi,

I have a key-value RDD called rdd below. After a groupBy, I tried to count
rows.
But the result is not unique, somehow non deterministic.

Here is the test code:

  val step1 = ligneReceipt_cleTable.persist
  val step2 = step1.groupByKey
  
  val s1size = step1.count
  val s2size = step2.count

  val t = step2 // rdd after groupBy

  val t1 = t.count
  val t2 = t.count
  val t3 = t.count
  val t4 = t.count
  val t5 = t.count
  val t6 = t.count
  val t7 = t.count
  val t8 = t.count

  println("s1size = " + s1size)
  println("s2size = " + s2size)
  println("1 => " + t1)
  println("2 => " + t2)
  println("3 => " + t3)
  println("4 => " + t4)
  println("5 => " + t5)
  println("6 => " + t6)
  println("7 => " + t7)
  println("8 => " + t8)

Here are the results:

s1size = 5338864
s2size = 5268001
1 => 5268002
2 => 5268001
3 => 5268001
4 => 5268002
5 => 5268001
6 => 5268002
7 => 5268002
8 => 5268001

Even if the difference is just one row, that's annoying.  

Any idea ?

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/groupBy-gives-non-deterministic-results-tp13698.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to