Re: Compute Median in Spark Dataframe

2015-06-22 Thread Deenar Toraskar
Many thanks, will look into this. I dont want to particularly reuse the custom Hive UDAF I have, would prefer writing a new one it that is cleaner. I am just using the JVM. On 5 June 2015 at 00:03, Holden Karau wrote: > My current example doesn't use a Hive UDAF, but you would do something > p

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Holden Karau
My current example doesn't use a Hive UDAF, but you would do something pretty similar (it calls a new user defined UDAF, and there are wrappers to make Spark SQL UDAFs from Hive UDAFs but they are private). So this is doable, but since it pokes at internals it will likely break between versions of

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Deenar Toraskar
Hi Holden, Olivier >>So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals. I also need to call a Hive UDAF in a dataframe agg function. Are there any examples of what Column expects? Deenar On 2 June 201

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals. On Tuesday, June 2, 2015, Olivier Girardot wrote: > Nice to hear from you Holden ! I ended up trying exactly that (Column) - > but I may have done it

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
Nice to hear from you Holden ! I ended up trying exactly that (Column) - but I may have done it wrong : In [*5*]: g.agg(Column("percentile(value, 0.5)")) Py4JError: An error occurred while calling o97.agg. Trace: py4j.Py4JException: Method agg([class java.lang.String, class scala.collection.immuta

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
Not super easily, the GroupedData class uses a strToExpr function which has a pretty limited set of functions so we cant pass in the name of an arbitrary hive UDAF (unless I'm missing something). We can instead construct an column with the expression you want and then pass it in to agg() that way (

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
I've finally come to the same conclusion, but isn't there any way to call this Hive UDAFs from the agg("percentile(key,0.5)") ?? Le mar. 2 juin 2015 à 15:37, Yana Kadiyska a écrit : > Like this...sqlContext should be a HiveContext instance > > case class KeyValue(key: Int, value: String) > val d

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Yana Kadiyska
Like this...sqlContext should be a HiveContext instance case class KeyValue(key: Int, value: String) val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF df.registerTempTable("table") sqlContext.sql("select percentile(key,0.5) from table").show() ​ On Tue, Jun 2, 2015 at 8:07 AM,

Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
Hi everyone, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within a dataframe. Hive seems to imply that using ntile one can compute percentiles, quartiles and therefore a median. Does anyone have experience with this