Dear Spark Users, I googled the web for several hours now but I don't find a solution for my problem. So maybe someone from this list can help.
I have an RDD of case classes, generated from CSV files with Spark. When I used the distinct operator, there were still duplicates. So I investigated and found out that the equals returns false although the two objects were equal (so were their individual fields as well as toStrings). After googling it I found that the case class equals might break in case the two objects are created by different class loaders. So I implemented my own equals method using mattern matching (code example below). It still didn't work. Some debugging revealed that the problem lies in the pattern matching. Depending on the objects I compare (and maybe the split / classloader they are generated in?) the patternmatching works /doesn't: case class Customer(id: String, age: Option[Int], entryDate: Option[java.util.Date]) { def equals(that: Any): Boolean = that match { case Customer(id, age, entryDate) => { println("Pattern matching worked!") this.id == id && this.age == age && this.entryDate == entryDate } case _ => false } } //val x: Array[Customer] // ... some spark code to filter original data and collect x scala> x(0) Customer("a", Some(5), Some(Fri Sep 23 00:00:00 CEST 1994)) scala> x(1) Customer("a", None, None) scala> x(2) Customer("a", None, None) scala> x(3) Customer("a", None, None) scala> x(0) == x(0) // should be true and works Pattern matching works! res0: Boolean = true scala> x(0) == x(1) // should be false and works Pattern matching works! res1: Boolean = false scala> x(1) == x(2) // should be true, does not work res2: Boolean = false scala> x(2) == x(3) // should be true, does not work Pattern matching works! res3: Boolean = true scala> x(0) == x(3) // should be false, does not work res4: Boolean = false Why is the pattern matching not working? It seems that there are two kinds of Customers: 0,1 and 2,3 which don't match somehow. Is this related to some classloaders? Is there a way around this other than using instanceof and defining a custom equals operation for every case class I write? Thanks for the help! Frank