Sorry if this functionality already exists or has been asked before, but I'm looking for an aggregate function in SparkSQL that allows me to collect a column into array per group in a grouped dataframe.
For example, if I have the following table user, score -------- user1, 1 user2, 2 user1, 3 user2, 4 user1, 5 I want to produce a dataframe that is like user1, [1, 3] user2, [2, 4, 5] (possibly via select collect(score) from table group by user) I realize I can probably implement this as a UDAF but just want to double check if such thing already exists. If not, would there be interests to have this in SparkSQL? To give some context, I am trying to do a cogroup on datasets and then persist as parquet but want to take advantage of Tungsten. So the plan is to, collapse row into key, value of struct => group by key and collect value as array[struct] => outer join dataframes. p.s. Here are some resources I have found so far but all of them concerns top K per key instead: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-way-to-get-top-K-values-per-key-in-key-value-RDD-td20370.html and https://issues.apache.org/jira/browse/SPARK-5954 (where Reynold mentioned this could be an API in dataframe) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Collect-Column-as-Array-in-Grouped-DataFrame-tp25223.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org