Hi Liu, I could not see any operator on DataFrame which will give the desired result . DataFrame APIs as expected works on Row format and a fixed set of operators on them. However you can achive the desired result by accessing the internal RDD as below..
val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2)) val rdd = testSparkContext.parallelize(s) val df = snc.createDataFrame(rdd) val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1)))) val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) } val rdd3 = rdd1.reduceByKey(reduceF) rdd3.foreach(r => println(r)) You can always reconvert the obtained RDD after tranformation and reduce to a DataFrame. Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo&trk=nav_responsive_tab_profile On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu <sliznmail...@gmail.com> wrote: > Hey Spark users, > > I'm trying to group by a dataframe, by appending occurrences into a list > instead of count. > > Let's say we have a dataframe as shown below: > > | category | id | > | -------- |:--:| > | A | 1 | > | A | 2 | > | B | 3 | > | B | 4 | > | C | 5 | > > ideally, after some magic group by (reverse explode?): > > | category | id_list | > | -------- | -------- | > | A | 1,2 | > | B | 3,4 | > | C | 5 | > > any tricks to achieve that? Scala Spark API is preferred. =D > > BR, > Todd Leo > > > > --