Hi Rishitesh,

I did it by CombineByKey, but your solution is more clear and readable, at
least doesn't require 3 lambda functions to get confused with. Will
definitely try it out tomorrow, thanks. 😁

Plus, OutOfMemoryError keeps bothering me as I read a massive amount of
json files, whereas the yielded RDD by CombineByKey is rather small. Anyway
I'll file another mail to describe this.

BR,
Todd Leo

Rishitesh Mishra <rmis...@snappydata.io>于2015年10月13日 周二19:05写道:

> Hi Liu,
> I could not see any operator on DataFrame which will give the desired
> result . DataFrame APIs as expected works on Row format and a fixed set of
> operators on them.
> However you can achive the desired result by accessing the internal RDD as
> below..
>
> val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2))
> val rdd = testSparkContext.parallelize(s)
> val df = snc.createDataFrame(rdd)
> val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1))))
>
> val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) }
>
> val rdd3 = rdd1.reduceByKey(reduceF)
> rdd3.foreach(r => println(r))
>
>
>
> You can always reconvert the obtained RDD after tranformation and reduce to a 
> DataFrame.
>
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
>
> https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo&trk=nav_responsive_tab_profile
>
> On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu <sliznmail...@gmail.com>
> wrote:
>
>> Hey Spark users,
>>
>> I'm trying to group by a dataframe, by appending occurrences into a list
>> instead of count.
>>
>> Let's say we have a dataframe as shown below:
>>
>> | category | id |
>> | -------- |:--:|
>> | A        | 1  |
>> | A        | 2  |
>> | B        | 3  |
>> | B        | 4  |
>> | C        | 5  |
>>
>> ideally, after some magic group by (reverse explode?):
>>
>> | category | id_list  |
>> | -------- | -------- |
>> | A        | 1,2      |
>> | B        | 3,4      |
>> | C        | 5        |
>>
>> any tricks to achieve that? Scala Spark API is preferred. =D
>>
>> BR,
>> Todd Leo
>>
>>
>>
>>
>
>
> --
>
>
>

Reply via email to