I suspect this is another instance of case classes not working as expected between the driver and executor when used with spark-shell. Search JIRA for some back story.
On Tue, Jan 5, 2016 at 12:42 AM, Arun Luthra <arun.lut...@gmail.com> wrote: > Spark 1.5.0 > > data: > > p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0 > p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0 > > spark-shell: > > spark-shell \ > --num-executors 2 \ > --driver-memory 1g \ > --executor-memory 10g \ > --executor-cores 8 \ > --master yarn-client > > > case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char, > f4:Char, f5:Char, f6:String) > case class Myvalue(count1:Long, count2:Long, num:Double) > > val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => { > val spl = line.split("\\|", -1) > val k = spl(0).split(",") > val v = spl(1).split(",") > (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar, > k(5)(0).toChar, k(6)(0).toChar, k(7)), > Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble) > ) > }} > > myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1) > }.collect().foreach(println) > > (Mykey(p1,lo1,8,0,4,0,5,20150901),1) > (Mykey(p1,lo1,8,0,4,0,5,20150901),1) > (Mykey(p1,lo3,8,0,4,0,5,20150901),1) > (Mykey(p1,lo3,8,0,4,0,5,20150901),1) > (Mykey(p1,lo4,8,0,4,0,5,20150901),1) > (Mykey(p1,lo4,8,0,4,0,5,20150901),1) > (Mykey(p1,lo2,8,0,4,0,5,20150901),1) > (Mykey(p1,lo2,8,0,4,0,5,20150901),1) > > > > You can see that each key is repeated 2 times but each key should only > appear once. > > Arun > > On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> Can you give a bit more information ? >> >> Release of Spark you're using >> Minimal dataset that shows the problem >> >> Cheers >> >> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <arun.lut...@gmail.com> wrote: >>> >>> I tried groupByKey and noticed that it did not group all values into the >>> same group. >>> >>> In my test dataset (a Pair rdd) I have 16 records, where there are only 4 >>> distinct keys, so I expected there to be 4 records in the groupByKey object, >>> but instead there were 8. Each of the 4 distinct keys appear 2 times. >>> >>> Is this the expected behavior? I need to be able to get ALL values >>> associated with each key grouped into a SINGLE record. Is it possible? >>> >>> Arun >>> >>> p.s. reducebykey will not be sufficient for me >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org