[ 
https://issues.apache.org/jira/browse/KAFKA-3511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guozhang Wang updated KAFKA-3511:
---------------------------------
    Description: 
Currently we have the following aggregation APIs in the Streams DSL:

{code}
KStream.aggregateByKey(..)
KStream.reduceByKey(..)
KStream.countByKey(..)
KTable.groupBy(...).aggregate(..)
KTable.groupBy(...).reduce(..)
KTable.groupBy(...).count(..)
{code}

And it is better to add common aggregation functions like Sum and Avg as 
built-in into the Streams DSL. A few questions to ask though:

1. Should we add those built-in functions as, for example 
{{KTable.groupBy(...).sum(...)} or {{KTable.groupBy(...).aggregate(SUM, ...)}}. 
Please see the comments below for detailed pros and cons.

2. If we go with the second option above, should we replace the countByKey / 
count operators with aggregate(COUNT) as well? Personally I (Guozhang) feel it 
is not necessary, as COUNT is a special aggregate function since we do not need 
to map on any value fields; this is the same approach as in Spark as well, 
where Count is built-in as first-citizen in the DSL, and others are built-in as 
{{aggregate(SUM)}}, etc.

  was:Currently we only have one built-in aggregate function count() in the 
Kafka Streams DSL, but we want to add more aggregation functions like sum() and 
avg().


> Add common aggregation functions like Sum and Avg as build-ins in Kafka 
> Streams DSL
> -----------------------------------------------------------------------------------
>
>                 Key: KAFKA-3511
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3511
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Guozhang Wang
>            Assignee: Ishita Mandhan
>              Labels: api
>             Fix For: 0.10.1.0
>
>
> Currently we have the following aggregation APIs in the Streams DSL:
> {code}
> KStream.aggregateByKey(..)
> KStream.reduceByKey(..)
> KStream.countByKey(..)
> KTable.groupBy(...).aggregate(..)
> KTable.groupBy(...).reduce(..)
> KTable.groupBy(...).count(..)
> {code}
> And it is better to add common aggregation functions like Sum and Avg as 
> built-in into the Streams DSL. A few questions to ask though:
> 1. Should we add those built-in functions as, for example 
> {{KTable.groupBy(...).sum(...)} or {{KTable.groupBy(...).aggregate(SUM, 
> ...)}}. Please see the comments below for detailed pros and cons.
> 2. If we go with the second option above, should we replace the countByKey / 
> count operators with aggregate(COUNT) as well? Personally I (Guozhang) feel 
> it is not necessary, as COUNT is a special aggregate function since we do not 
> need to map on any value fields; this is the same approach as in Spark as 
> well, where Count is built-in as first-citizen in the DSL, and others are 
> built-in as {{aggregate(SUM)}}, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to