Hi Liu,
I could not see any operator on DataFrame which will give the desired
result . DataFrame APIs as expected works on Row format and a fixed set of
operators on them.
However you can achive the desired result by accessing the internal RDD as
below..
val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2))
val rdd = testSparkContext.parallelize(s)
val df = snc.createDataFrame(rdd)
val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1))))
val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) }
val rdd3 = rdd1.reduceByKey(reduceF)
rdd3.foreach(r => println(r))
You can always reconvert the obtained RDD after tranformation and
reduce to a DataFrame.
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)
https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo&trk=nav_responsive_tab_profile
On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu <[email protected]> wrote:
> Hey Spark users,
>
> I'm trying to group by a dataframe, by appending occurrences into a list
> instead of count.
>
> Let's say we have a dataframe as shown below:
>
> | category | id |
> | -------- |:--:|
> | A | 1 |
> | A | 2 |
> | B | 3 |
> | B | 4 |
> | C | 5 |
>
> ideally, after some magic group by (reverse explode?):
>
> | category | id_list |
> | -------- | -------- |
> | A | 1,2 |
> | B | 3,4 |
> | C | 5 |
>
> any tricks to achieve that? Scala Spark API is preferred. =D
>
> BR,
> Todd Leo
>
>
>
>
--