spark-shell bug with RDD distinct?

Jay Hutfles Fri, 19 Dec 2014 10:56:57 -0800

Found a problem in the spark-shell, but can't confirm that it's related to
open issues on Spark's JIRA page.  I was wondering if anyone could help
identify if this is an issue or if it's already being addressed.


Test:  (in spark-shell)
case class Person(name: String, age: Int)
val peopleList = List(Person("Alice", 35), Person("Bob", 47),
Person("Alice", 35), Person("Bob", 15))
val peopleRDD = sc.parallelize(peopleList)
assert(peopleList.distinct.size == peopleRDD.distinct.count)


At first I thought it was related to issue SPARK-2620 (
https://issues.apache.org/jira/browse/SPARK-2620), which says case classes
can't be used as keys in spark-shell due to how case classes are compiled
by the REPL.  It lists .reduceByKey, .groupByKey and .distinct as being
affected.  But the associated pull request for adding tests to cover this (
https://github.com/apache/spark/pull/1588) was closed.

Is this something I just have to live with when using the REPL?  Or is this
covered by something bigger that's being addressed?

Thanks in advance
   -Jay

spark-shell bug with RDD distinct?

Reply via email to