[ https://issues.apache.org/jira/browse/FLINK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616853#comment-16616853 ]
Jiayi Liao commented on FLINK-10348: ------------------------------------ [~twalthr][~Zentol] What do you think of this feature? > Solve data skew when consuming data from kafka > ---------------------------------------------- > > Key: FLINK-10348 > URL: https://issues.apache.org/jira/browse/FLINK-10348 > Project: Flink > Issue Type: New Feature > Components: Kafka Connector > Affects Versions: 1.6.0 > Reporter: Jiayi Liao > Assignee: Jiayi Liao > Priority: Major > > By using KafkaConsumer, our strategy is to send fetch request to brokers with > a fixed fetch size. Assume x topic has n partition and there exists data skew > between partitions, now we need to consume data from x topic with earliest > offset, and we can get max fetch size data in every fetch request. The > problem is that when an task consumes data from both "big" partitions and > "small" partitions, the data in "big" partitions may be late elements because > "small" partitions are consumed faster. > *Solution: * > I think we can leverage two parameters to control this. > 1. data.skew.check // whether to check data skew > 2. data.skew.check.interval // the interval between checks > Every data.skew.check.interval, we will check the latest offset of every > specific partition, and calculate (latest offset - current offset), then get > partitions which need to slow down and redefine their fetch size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)