It should be: groupBy -> always trigger repartitioning groupByKey -> maybe trigger repartitioning
And there will not be two repartitioning topics. The repartitioning will be done by the groupBy/groupByKey operation, and thus, in the aggregation step we know that data is correctly partitioned and there will be no second repartitioning topic. -Matthias On 3/1/17 11:25 AM, Michael Noll wrote: > FYI: The difference between `groupBy` (may trigger re-partitioning) vs. > `groupByKey` (does not trigger re-partitioning) also applies to: > > - `map` vs. `mapValues` > - `flatMap` vs. `flatMapValues` > > > > On Wed, Mar 1, 2017 at 8:15 PM, Damian Guy <damian....@gmail.com> wrote: > >> If you use stream.groupByKey() then there will be no repartitioning as long >> as there have been no key changing operations preceding it, i.e, map, >> selectKey, flatMap, transform. If you use stream.groupBy(...) then we see >> it as a key changing operation, hence we need to repartition the data. >> >> On Wed, 1 Mar 2017 at 18:59 Tianji Li <skyah...@gmail.com> wrote: >> >>> Hi there, >>> >>> I wonder if it makes sense to give the option to disable auto >>> repartitioning while doing groupBy. >>> >>> I understand with https://issues.apache.org/jira/browse/KAFKA-3561, >>> an internal topic for repartition will be automatically created and >> synced >>> to brokers, which is useful when aggregation keys are not the ones used >>> when ingesting raw data. >>> >>> However, in my case, I have carefully partitioned the data when ingesting >>> my raw topics. If I do groupBy followed by aggregation, there will be TWO >>> change logs topics, one for groupBy another or aggregation. >>> >>> Does it make sense to make the groupBy one configurable? >>> >>> Thanks >>> Tianji >>> >> > > >
signature.asc
Description: OpenPGP digital signature