Re: groupByKey does not work?

Sean Owen Tue, 05 Jan 2016 01:30:52 -0800

I suspect this is another instance of case classes not working as
expected between the driver and executor when used with spark-shell.
Search JIRA for some back story.


On Tue, Jan 5, 2016 at 12:42 AM, Arun Luthra <arun.lut...@gmail.com> wrote:
> Spark 1.5.0
>
> data:
>
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>
> spark-shell:
>
> spark-shell \
>     --num-executors 2 \
>     --driver-memory 1g \
>     --executor-memory 10g \
>     --executor-cores 8 \
>     --master yarn-client
>
>
> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
> f4:Char, f5:Char, f6:String)
> case class Myvalue(count1:Long, count2:Long, num:Double)
>
> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
>     val spl = line.split("\\|", -1)
>     val k = spl(0).split(",")
>     val v = spl(1).split(",")
>     (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
> k(5)(0).toChar, k(6)(0).toChar, k(7)),
>      Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
>     )
> }}
>
> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
> }.collect().foreach(println)
>
> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>
>
>
> You can see that each key is repeated 2 times but each key should only
> appear once.
>
> Arun
>
> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>> Can you give a bit more information ?
>>
>> Release of Spark you're using
>> Minimal dataset that shows the problem
>>
>> Cheers
>>
>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <arun.lut...@gmail.com> wrote:
>>>
>>> I tried groupByKey and noticed that it did not group all values into the
>>> same group.
>>>
>>> In my test dataset (a Pair rdd) I have 16 records, where there are only 4
>>> distinct keys, so I expected there to be 4 records in the groupByKey object,
>>> but instead there were 8. Each of the 4 distinct keys appear 2 times.
>>>
>>> Is this the expected behavior? I need to be able to get ALL values
>>> associated with each key grouped into a SINGLE record. Is it possible?
>>>
>>> Arun
>>>
>>> p.s. reducebykey will not be sufficient for me
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: groupByKey does not work?

Reply via email to