Found a problem in the spark-shell, but can't confirm that it's related to open issues on Spark's JIRA page. I was wondering if anyone could help identify if this is an issue or if it's already being addressed.
Test: (in spark-shell) case class Person(name: String, age: Int) val peopleList = List(Person("Alice", 35), Person("Bob", 47), Person("Alice", 35), Person("Bob", 15)) val peopleRDD = sc.parallelize(peopleList) assert(peopleList.distinct.size == peopleRDD.distinct.count) At first I thought it was related to issue SPARK-2620 ( https://issues.apache.org/jira/browse/SPARK-2620), which says case classes can't be used as keys in spark-shell due to how case classes are compiled by the REPL. It lists .reduceByKey, .groupByKey and .distinct as being affected. But the associated pull request for adding tests to cover this ( https://github.com/apache/spark/pull/1588) was closed. Is this something I just have to live with when using the REPL? Or is this covered by something bigger that's being addressed? Thanks in advance -Jay