Hi, We are experience issues scaling our Flink application and we have observed that it may be because Kafka messages consumption is not balanced across partitions. The attached image (lag per partition) shows how only one partition consumes messages (the blue one in the back) and it wasn't until it finished that the other ones started to consume at a good rate (actually the total throughput multiplied by 4 when these started) . Also, when that ones started to consume, one partition just stopped an accumulated messages back again until they finished.
We don't see any resource (CPU, network, disk..) struggling in our cluster so we are not sure what could be causing this behavior. I can only assume that somehow Flink or the Kafka consumer is artificially slowing down the other partitions. Maybe due to how back pressure is handled? <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1007/consumer_max_lag.png> Gerard -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/