Re: Is there a reduceByKey functionality in DataFrame API?

Marius Soutier Mon, 29 Aug 2016 06:32:29 -0700

Not really, a grouped DataFrame only provides SQL-like functions like sum and 
avg (at least in 1.5).


> On 29.08.2016, at 14:56, ayan guha <guha.a...@gmail.com> wrote:
> 
> If you are confused because of  the name of two APIs. I think DF API name 
> groupBy came from SQL, but it works similarly as reducebykey.
> 
> On 29 Aug 2016 20:57, "Marius Soutier" <mps....@gmail.com 
> <mailto:mps....@gmail.com>> wrote:
> In DataFrames (and thus in 1.5 in general) this is not possible, correct?
> 
>> On 11.08.2016, at 05:42, Holden Karau <hol...@pigscanfly.ca 
>> <mailto:hol...@pigscanfly.ca>> wrote:
>> 
>> Hi Luis,
>> 
>> You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you 
>> can do groupBy followed by a reduce on the GroupedDataset ( 
>> http://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.GroupedDataset
>>  
>> <http://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.GroupedDataset>
>>  ) - this works on a per-key basis despite the different name. In Spark 2.0 
>> you would use groupByKey on the Dataset followed by reduceGroups ( 
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.KeyValueGroupedDataset
>>  
>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.KeyValueGroupedDataset>
>>  ).
>> 
>> Cheers,
>> 
>> Holden :)
>> 
>> On Wed, Aug 10, 2016 at 5:15 PM, luismattor <luismat...@gmail.com 
>> <mailto:luismat...@gmail.com>> wrote:
>> Hi everyone,
>> 
>> Consider the following code:
>> 
>> val result = df.groupBy("col1").agg(min("col2"))
>> 
>> I know that rdd.reduceByKey(func) produces the same RDD as
>> rdd.groupByKey().mapValues(value => value.reduce(func)) However reducerByKey
>> is more efficient as it avoids shipping each value to the reducer doing the
>> aggregation (it ships partial aggregations instead).
>> 
>> I wonder whether the DataFrame API optimizes the code doing something
>> similar to what RDD.reduceByKey does.
>> 
>> I am using Spark 1.6.2.
>> 
>> Regards,
>> Luis
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame-API-tp27508.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame-API-tp27508.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> <http://nabble.com/>.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>> 
>> 
>> 
>> -- 
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>

Re: Is there a reduceByKey functionality in DataFrame API?

Reply via email to