Thanks, that makes sense ! On Mon, Nov 26, 2018 at 1:06 PM Fabian Hueske <fhue...@gmail.com> wrote:
> Hi, > > DataStream x = ... > x.rebalance().keyBy() > > is not a good idea. > > It will first distribute the records round-robin (over the network) and > subsequently partition them by hash. > The first shuffle is unnecessary. It does not have any effect because it > is undone by the second partitioning. > > Btw. any methods on DataStream do not have any effect on Kafka topcis or > partitions. > In the initially quoted example, we assume that the events of the original > DataStream are not evenly distributed among the parallel tasks. The > rebalance() call generates an even distribution which is especially > important if the map() operation is heavy-weight / compute intensive. > > Best, Fabian > > > > > > Am Mo., 26. Nov. 2018 um 10:59 Uhr schrieb Taher Koitawala < > taher.koitaw...@gslab.com>: > >> You can use rebalance before keyBy because rebalance returns DataStream. >> The API does not allow rebalance on keyedStreamed which is returned after >> keyBy so you are safe. >> >> On Mon 26 Nov, 2018, 2:25 PM Avi Levi <avi.l...@bluevoyant.com wrote: >> >>> Ok, thanks for the clarification. but if I use it with keyed state so >>> the partition is by the key. rebalancing will not shuffle this partitioning >>> ? e.g >>> .addSource(source) >>> .rebalance >>> .keyBy(_.id) >>> .mapWithState(...) >>> >>> >>> On Mon, Nov 26, 2018 at 8:32 AM Taher Koitawala < >>> taher.koitaw...@gslab.com> wrote: >>> >>>> Hi Avi, >>>> No, rebalance is not changing the number of kafka partitions. >>>> Lets say you have 6 kafka partitions and your flink parallelism is 8, in >>>> this case using rebalance will send records to all downstream operators in >>>> a round robin fashion. >>>> >>>> Regards, >>>> Taher Koitawala >>>> GS Lab Pune >>>> +91 8407979163 >>>> >>>> >>>> On Mon, Nov 26, 2018 at 11:33 AM Avi Levi <avi.l...@bluevoyant.com> >>>> wrote: >>>> >>>>> Hi >>>>> Looking at this example >>>>> <https://github.com/dataArtisans/kafka-example/blob/master/src/main/java/com/dataartisans/ReadFromKafka.java>, >>>>> doing the "rebalance" (e.g messageStream.rebalance().map(...) ) >>>>> operation on heavy load stream wouldn't slow the stream ? is the >>>>> rebalancing action occurs only when there is a partition change ? >>>>> it says that "the rebelance call is causing a repartitioning of the >>>>> data so that all machines" is it actually changing the num of >>>>> partitions of the topic to match the num of flink operators ? >>>>> >>>>> Avi >>>>> >>>>