Re: kafka partition assignment

2016-06-21 Thread Michal Hariš
Hi Tai, OK, thanks for confirming. I understand that streaming shuffle is cheaper than batch-spill shuffle, but nevertheless may be unacceptable in large volume applications - it is still a network shuffle and that's the biggest part of the cost. Now on the point of the trade off between load-bala

Re: kafka partition assignment

2016-06-20 Thread Tai Gordon
Hi Michal, I see, thanks for the description. I think you’ve definitely raised an interesting point. Yes, there may be unnecessary shuffle in this case (the partitionCustom on consumed DataStream doesn’t override the assignPartitions in the Kafka connector; custom partitioners are applied after th

Re: kafka partition assignment

2016-06-20 Thread Michal Hariš
Hi Tai, I was referring to co-partitioning, not co-location of leaders, i.e. multiple topics that share the same partitioning scheme. By example, say I have 2 topics which share the same keyspace and which are produced by something other than Flink using identical partitioner. The data in these 2 t

Re: kafka partition assignment

2016-06-20 Thread Tai Gordon
Hi Michal, Whether or not the external system's partitioning scheme is referenced when assigning partitions to the consumer parallel subtasks depends on the implementation of each connector / source. First, clarification on “co-partitioning": from your context I’m assuming you’re referring to co-

kafka partition assignment

2016-06-16 Thread Michal Hariš
Hi, I was recently looking into a kafka connector issue (FLINK-4023 / FLINK-4069), when it was pointed out that partition assignment will not be deterministic if the partition discovery is imply moved to the open() method. In the assignPartitions of FlinkKafkaConsumerBase a modulo on the __index__