import org.apache.spark.sql.functions._
df.groupBy("category")
.agg(callUDF("collect_set", df("id")).as("id_list"))
On Mon, Oct 12, 2015 at 11:08 PM, SLiZn Liu <[email protected]> wrote:
> Hey Spark users,
>
> I'm trying to group by a dataframe, by appending occurrences into a list
> instead of count.
>
> Let's say we have a dataframe as shown below:
>
> | category | id |
> | -------- |:--:|
> | A | 1 |
> | A | 2 |
> | B | 3 |
> | B | 4 |
> | C | 5 |
>
> ideally, after some magic group by (reverse explode?):
>
> | category | id_list |
> | -------- | -------- |
> | A | 1,2 |
> | B | 3,4 |
> | C | 5 |
>
> any tricks to achieve that? Scala Spark API is preferred. =D
>
> BR,
> Todd Leo
>
>
>
>