Sorry if this functionality already exists or has been asked before, but I'm
looking for an aggregate function in SparkSQL that allows me to collect a
column into array per group in a grouped dataframe.

For example, if I have the following table
user, score
--------
user1, 1
user2, 2
user1, 3
user2, 4
user1, 5

I want to produce a dataframe that is like

user1, [1, 3]
user2, [2, 4, 5]

(possibly via select collect(score) from table group by user)


I realize I can probably implement this as a UDAF but just want to double
check if such thing already exists. If not, would there be interests to have
this in SparkSQL?

To give some context, I am trying to do a cogroup on datasets and then
persist as parquet but want to take advantage of Tungsten. So the plan is
to, collapse row into key, value of struct => group by key and collect value
as array[struct] => outer join dataframes.

p.s. Here are some resources I have found so far but all of them concerns
top K per key instead:
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-td20370.html
and
https://issues.apache.org/jira/browse/SPARK-5954 (where Reynold mentioned
this could be an API in dataframe)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Collect-Column-as-Array-in-Grouped-DataFrame-tp25223.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to