[ https://issues.apache.org/jira/browse/KAFKA-3064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121886#comment-15121886 ]
Michael Graff commented on KAFKA-3064: -------------------------------------- Any thought on this issue? If all we had to use were two gross-level setting knobs, our system would work much better if/when we take a broker down. We've been doing that a lot for what appears to be chronic hardware issues, but we think these are resolved. Knob one: re-sync order. This would be a topic-level setting that specifies a sync priority. Multiple topics could be at the same level, and they are handled such that a high priority topic is sync'd before any lower priority topics are. I would suggest priority of between 0 and <max-int> where lower numbers mean lower priority. If there are other settings where the reverse is true, that would work too, but unset or "0" may need special handling then. Knob two: a mode that says "use an alternate sync method" where only the new data is tracked. No historical data is pulled. This would be ignored for compacted topics. For our high data rate topics, this would cause a re-sync to occur in about 8 hours, where now it takes 24+. Knob three: a wish. Later, once you have Knob Two in place and the definition of "in sync" means "I have the first and last offset from the leader" a more sane historical pull method could be added, where the head is tracked, and data is progressively loaded in blocks in reverse. For instance, the slave could say "what is the next most logical sync offset prior to X". The first one would be the start of the currently open file. From that point onward, the best most logical sync offset is the start of the previous file, one by one, until both head and tail are caught up. This would significantly reduce the time taken to sync Kafka topics, and provide the operator with some control about which are truly more important than others. > Improve resync method to waste less time and data transfer > ---------------------------------------------------------- > > Key: KAFKA-3064 > URL: https://issues.apache.org/jira/browse/KAFKA-3064 > Project: Kafka > Issue Type: Improvement > Components: controller, network > Affects Versions: 0.8.2.1, 0.9.0.0 > Reporter: Michael Graff > Assignee: Neha Narkhede > > We have several topics which are large (65 GB per partition) with 12 > partitions. Data rates into each topic vary, but in general each one has its > own rate. > After a raid rebuild, we are pulling all the data over to the newly rebuild > raid. This takes forever, and has yet to complete after nearly 8 hours. > Here are my observations: > (1) The Kafka broker seems to pull from all topics on all partitions at the > same time, starting at the oldest message. > (2) When you divide total disk bandwidth available across all partitions > (really, only 48 of which have significant amounts of data, about 65 * 12 = > 780 GB each topic) the ingest rate of 36 out of 48 of them is higher than the > available bandwidth. > (3) The effect of (2) is that one topic SLOWLY catches up, while the other 4 > topics continue to retrieve data at 75% of the bandwidth, just to toss it > away because the source broker has discarded it already. > (4) Eventually that one topic catches up, and the remaining bandwidth is > then divided into the remaining 36 partitions, one group of which starts to > catch up again. > What I want to see is a way to say “don’t transfer more than X partitions at > the same time” and ideally a priority rule that says, “Transfer partitions > you are responsible for first, then transfer ones you are not. Also, > transfer these first, then those, but no more than 1 topic at a time” > What I REALLY want is for Kafka to track the new data (track the head of the > log) and then ask for the tail in chunks. Ideally this would request from > the source, “what is the next logical older starting point?” and then start > there. This way, the transfer basically becomes a file transfer of the log > stored on the source’s disk. Once that block is retrieved, it moves on to the > next oldest. This way, there is almost zero waste as both the head and tail > grow, but the tail runs the risk of losing the final chunk only. Thus, > bandwidth is not significantly wasted. > All this changes the ISR check to be is “am I caught up on head AND tail?” > when the tail part is implied right now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)