Something wrong with sortBy

2016-04-21 Thread tuan3w
I'm working on implementing LSH on Spark. I start with an implementation provided by SoundCloud: https://github.com/soundcloud/cosine-lsh-join-spark/blob/master/src/main/scala/com/soundcloud/lsh/Lsh.scala when I check WebUI, I see that after call sortBy, the number of partitions of RDD descreases f

Element appear in both 2 splits of RDD after using randomSplit

2016-02-20 Thread tuan3w
I'm training a model using MLLib. When I try to split data into training and test data, I found a weird problem. I can't figure what problem is happening here. Here is my code in experiment: val logData = rdd.map(x => (x._1, x._2)).distinct() val ratings: RDD[Rating] = logData.map(x => Rating(x