[ https://issues.apache.org/jira/browse/KAFKA-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635283#comment-14635283 ]
Flavio Junqueira commented on KAFKA-2188: ----------------------------------------- hey tim, I had a look at the proposal, and I have some feedback, mostly questions at this point. I like this improvement, and in general, I've found that we can improve quite a bit exception handling in Kafka. This is clearly one such great effort. Specifically, here are more concrete points: # In the exception handler section, I'd say that the best approach is to be conservative and remove the drive in the case of an error. Let's not optimize too much trying to get the exact partitions that are affected by an error and such. If there is an error, then let an operator check it out and reinsert the drive when fixed. As part of this comment, I'd say that it'd be a good feature to allow drives to be inserted (manually). # In the notifying controller discussion, could you be more specific about the race you're concerned about? I can tell that you're pointing out to a potential race, but I'm not sure what it is. # Open question 1: disk availability. It's kind of hard to detect exactly what happened with a faulty disk. It could be disk full, drive is bad, or even just some annoying data corruption. I don't think it is worth spending tons of time and effort trying to make a great check. If we spot an error, then remove the drive and log it. I don't know if there is any typical mechanism to notify operators with Kafka. # Open question 2: log read. I think I know the problem you're referring to, and I'll have a look to see if I can suggest some decent alternative, but we might need to make it a bit less efficient to be able to handle IO errors properly. # Open question 3: restart partition. This is about the race I asked above. # Open question 4: operation retries. What would be a situation in which it is worth retrying? I was actually wondering if some users would be interested in the case of leaving a fraction of the drives unused to replace faulty drives over time. The advantage is to be able to maintain the capacity of a broker despite faulty drives, but surely you have some unused IO capacity in the broker. > JBOD Support > ------------ > > Key: KAFKA-2188 > URL: https://issues.apache.org/jira/browse/KAFKA-2188 > Project: Kafka > Issue Type: Bug > Reporter: Andrii Biletskyi > Assignee: Andrii Biletskyi > Attachments: KAFKA-2188.patch, KAFKA-2188.patch, KAFKA-2188.patch > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support -- This message was sent by Atlassian JIRA (v6.3.4#6332)