Re: DataFrame Min By Column

2016-07-09 Thread Michael Armbrust
I would guess that using the built in min/max/struct functions will be much faster than a UDAF. They should have native internal implementations that utilize code generation. On Sat, Jul 9, 2016 at 2:20 PM, Pedro Rodriguez wrote: > Thanks Michael, > > That seems like the analog to sorting tuple

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
Thanks Michael, That seems like the analog to sorting tuples. I am curious, is there a significant performance penalty to the UDAF versus that? Its certainly nicer and more compact code at least. — Pedro Rodriguez PhD Student in Large-Scale Machine Learning | CU Boulder Systems Oriented Data Sc

Re: DataFrame Min By Column

2016-07-09 Thread Michael Armbrust
You can do whats called an *argmax/argmin*, where you take the min/max of a couple of columns that have been grouped together as a struct. We sort in column order, so you can put the timestamp first. Here is an example

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
I implemented a more generic version which I posted here:  https://gist.github.com/EntilZha/3951769a011389fef25e930258c20a2a I think I could generalize this by pattern matching on DataType to use different getLong/getDouble/etc functions ( not trying to use getAs[] because getting T from Array[T

Re: DataFrame Min By Column

2016-07-09 Thread Pedro Rodriguez
Hi Xinh, A co-worker also found that solution but I thought it was possibly overkill/brittle so looks into UDAFs (user defined aggregate functions). I don’t have code, but Databricks has a post that has an example  https://databricks.com/blog/2015/09/16/apache-spark-1-5-dataframe-api-highlights.

Re: DataFrame Min By Column

2016-07-08 Thread Xinh Huynh
Hi Pedro, I could not think of a way using an aggregate. It's possible with a window function, partitioned on user and ordered by time: // Assuming "df" holds your dataframe ... import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions.Window val wSpec = Window.partitionBy(