Hi Liu,
I could not see any operator on DataFrame which will give the desired
result . DataFrame APIs as expected works on Row format and a fixed set of
operators on them.
However you can achive the desired result by accessing the internal RDD as
below..

val s = Seq(Test("A",1), Test("A",2),Test("B",1),Test("B",2))
val rdd = testSparkContext.parallelize(s)
val df = snc.createDataFrame(rdd)
val rdd1 = df.rdd.map(p => (Seq(p.getString(0)), Seq(p.getInt(1))))

val reduceF = (p: Seq[Int], q: Seq[Int]) => { Seq(p.head, q.head) }

val rdd3 = rdd1.reduceByKey(reduceF)
rdd3.foreach(r => println(r))



You can always reconvert the obtained RDD after tranformation and
reduce to a DataFrame.


Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://www.linkedin.com/profile/view?id=AAIAAAIFdkMB_v-nolCrFH6_pKf9oH6tZD8Qlgo&trk=nav_responsive_tab_profile

On Tue, Oct 13, 2015 at 11:38 AM, SLiZn Liu <sliznmail...@gmail.com> wrote:

> Hey Spark users,
>
> I'm trying to group by a dataframe, by appending occurrences into a list
> instead of count.
>
> Let's say we have a dataframe as shown below:
>
> | category | id |
> | -------- |:--:|
> | A        | 1  |
> | A        | 2  |
> | B        | 3  |
> | B        | 4  |
> | C        | 5  |
>
> ideally, after some magic group by (reverse explode?):
>
> | category | id_list  |
> | -------- | -------- |
> | A        | 1,2      |
> | B        | 3,4      |
> | C        | 5        |
>
> any tricks to achieve that? Scala Spark API is preferred. =D
>
> BR,
> Todd Leo
>
>
>
>


--

Reply via email to