[ https://issues.apache.org/jira/browse/KAFKA-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089449#comment-15089449 ]
Michael Graff commented on KAFKA-3064: -------------------------------------- Example log messages indicating this issue is occurring: Jan 8 16:22:25 broker2 kafka [ReplicaFetcherThread-3-1] WARN [ReplicaFetcherThread-3-1], Replica 2 for partition [nom-dns-base-text,3] reset its fetch offset from 372718324 to current leader 1's start offset 372718324 (kafka.server.ReplicaFetcherThread) Jan 8 16:22:25 broker2 kafka [ReplicaFetcherThread-3-1] ERROR [ReplicaFetcherThread-3-1], Current offset 372712344 for partition [nom-dns-base-text,3] out of range; reset offset to 372718324 (kafka.server.ReplicaFetcherThread) > Improve resync method to waste less time and data transfer > ---------------------------------------------------------- > > Key: KAFKA-3064 > URL: https://issues.apache.org/jira/browse/KAFKA-3064 > Project: Kafka > Issue Type: Improvement > Components: controller, network > Reporter: Michael Graff > Assignee: Neha Narkhede > > We have several topics which are large (65 GB per partition) with 12 > partitions. Data rates into each topic vary, but in general each one has its > own rate. > After a raid rebuild, we are pulling all the data over to the newly rebuild > raid. This takes forever, and has yet to complete after nearly 8 hours. > Here are my observations: > (1) The Kafka broker seems to pull from all topics on all partitions at the > same time, starting at the oldest message. > (2) When you divide total disk bandwidth available across all partitions > (really, only 48 of which have significant amounts of data, about 65 * 12 = > 780 GB each topic) the ingest rate of 36 out of 48 of them is higher than the > available bandwidth. > (3) The effect of (2) is that one topic SLOWLY catches up, while the other 4 > topics continue to retrieve data at 75% of the bandwidth, just to toss it > away because the source broker has discarded it already. > (4) Eventually that one topic catches up, and the remaining bandwidth is > then divided into the remaining 36 partitions, one group of which starts to > catch up again. > What I want to see is a way to say “don’t transfer more than X partitions at > the same time” and ideally a priority rule that says, “Transfer partitions > you are responsible for first, then transfer ones you are not. Also, > transfer these first, then those, but no more than 1 topic at a time” > What I REALLY want is for Kafka to track the new data (track the head of the > log) and then ask for the tail in chunks. Ideally this would request from > the source, “what is the next logical older starting point?” and then start > there. This way, the transfer basically becomes a file transfer of the log > stored on the source’s disk. Once that block is retrieved, it moves on to the > next oldest. This way, there is almost zero waste as both the head and tail > grow, but the tail runs the risk of losing the final chunk only. Thus, > bandwidth is not significantly wasted. > All this changes the ISR check to be is “am I caught up on head AND tail?” > when the tail part is implied right now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)