Jiayi Liao created FLINK-10348:
----------------------------------

             Summary: Solve data skew when consuming data from kafka
                 Key: FLINK-10348
                 URL: https://issues.apache.org/jira/browse/FLINK-10348
             Project: Flink
          Issue Type: New Feature
          Components: Kafka Connector
    Affects Versions: 1.6.0
            Reporter: Jiayi Liao
            Assignee: Jiayi Liao


By using KafkaConsumer, our strategy is to send fetch request to brokers with a 
fixed fetch size. Assume x topic has n partition and there exists data skew 
between partitions, now we need to consume data from x topic with earliest 
offset, and we can get max fetch size data in every fetch request. The problem 
is that when an task consumes data from both "big" partitions and "small" 
partitions, the data in "big" partitions may be late elements because "small" 
partitions are consumed faster.

*Solution: *
I think we can leverage two parameters to control this.
1. data.skew.check // whether to check data skew
2. data.skew.check.interval // the interval between checks
Every data.skew.check.interval, we will check the latest offset of every 
specific partition, and calculate (latest offset - current offset), then get 
partitions which need to slow down and redefine their fetch size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to