Hi Rishitesh, I did it by CombineByKey, but your solution is more clear and readable, at least doesn't require 3 lambda functions to get confused with. Will definitely try it out tomorrow, thanks. 😁
Plus, OutOfMemoryError keeps bothering me as I read a massive amount of json files, whereas the yielded RDD by CombineByKey is rather small. Anyway I'll file another mail to describe this. BR, Todd Leo Rishitesh Mishra <rmis...@snappydata.io>于2015年10月13日 周二19:05写道: > Hi Liu, > I could not see any operator on DataFrame which will give the desired > result . DataFrame APIs as expected works on Row format and a fixed set of > operators on them. > However you can achive the desired result by accessing the internal RDD as > below.. > > val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2)) > val rdd = testSparkContext.parallelize(s) > val df = snc.createDataFrame(rdd) > val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1)))) > > val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) } > > val rdd3 = rdd1.reduceByKey(reduceF) > rdd3.foreach(r => println(r)) > > > > You can always reconvert the obtained RDD after tranformation and reduce to a > DataFrame. > > > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > > https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo&trk=nav_responsive_tab_profile > > On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu <sliznmail...@gmail.com> > wrote: > >> Hey Spark users, >> >> I'm trying to group by a dataframe, by appending occurrences into a list >> instead of count. >> >> Let's say we have a dataframe as shown below: >> >> | category | id | >> | -------- |:--:| >> | A | 1 | >> | A | 2 | >> | B | 3 | >> | B | 4 | >> | C | 5 | >> >> ideally, after some magic group by (reverse explode?): >> >> | category | id_list | >> | -------- | -------- | >> | A | 1,2 | >> | B | 3,4 | >> | C | 5 | >> >> any tricks to achieve that? Scala Spark API is preferred. =D >> >> BR, >> Todd Leo >> >> >> >> > > > -- > > >