Re: groupByKey does not work?

Arun Luthra Mon, 04 Jan 2016 17:06:13 -0800

If I simplify the key to String column with values lo1, lo2, lo3, lo4, it
works correctly.


On Mon, Jan 4, 2016 at 4:49 PM, Daniel Imberman <[email protected]>
wrote:

> Could you try simplifying the key and seeing if that makes any difference?
> Make it just a string or an int so we can count out any issues in object
> equality.
>
> On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra <[email protected]> wrote:
>
>> Spark 1.5.0
>>
>> data:
>>
>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo1,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo2,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo3,8,0,4,0,5,20150901|50000,10000,1.0
>> p1,lo4,8,0,4,0,5,20150901|50000,10000,1.0
>>
>> spark-shell:
>>
>> spark-shell \
>>     --num-executors 2 \
>>     --driver-memory 1g \
>>     --executor-memory 10g \
>>     --executor-cores 8 \
>>     --master yarn-client
>>
>>
>> case class Mykey(uname:String, lo:String, f1:Char, f2:Char, f3:Char,
>> f4:Char, f5:Char, f6:String)
>> case class Myvalue(count1:Long, count2:Long, num:Double)
>>
>> val myrdd = sc.textFile("/user/al733a/mydata.txt").map { case line => {
>>     val spl = line.split("\\|", -1)
>>     val k = spl(0).split(",")
>>     val v = spl(1).split(",")
>>     (Mykey(k(0), k(1), k(2)(0).toChar, k(3)(0).toChar, k(4)(0).toChar,
>> k(5)(0).toChar, k(6)(0).toChar, k(7)),
>>      Myvalue(v(0).toLong, v(1).toLong, v(2).toDouble)
>>     )
>> }}
>>
>> myrdd.groupByKey().map { case (mykey, val_iterable) => (mykey, 1)
>> }.collect().foreach(println)
>>
>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>>
>> (Mykey(p1,lo1,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo3,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo4,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>> (Mykey(p1,lo2,8,0,4,0,5,20150901),1)
>>
>>
>>
>> You can see that each key is repeated 2 times but each key should only
>> appear once.
>>
>> Arun
>>
>> On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu <[email protected]> wrote:
>>
>>> Can you give a bit more information ?
>>>
>>> Release of Spark you're using
>>> Minimal dataset that shows the problem
>>>
>>> Cheers
>>>
>>> On Mon, Jan 4, 2016 at 3:55 PM, Arun Luthra <[email protected]>
>>> wrote:
>>>
>>>> I tried groupByKey and noticed that it did not group all values into
>>>> the same group.
>>>>
>>>> In my test dataset (a Pair rdd) I have 16 records, where there are only
>>>> 4 distinct keys, so I expected there to be 4 records in the groupByKey
>>>> object, but instead there were 8. Each of the 4 distinct keys appear 2
>>>> times.
>>>>
>>>> Is this the expected behavior? I need to be able to get ALL values
>>>> associated with each key grouped into a SINGLE record. Is it possible?
>>>>
>>>> Arun
>>>>
>>>> p.s. reducebykey will not be sufficient for me
>>>>
>>>
>>>
>>

Re: groupByKey does not work?

Reply via email to