[ 
https://issues.apache.org/jira/browse/KAFKA-2188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635283#comment-14635283
 ] 

Flavio Junqueira commented on KAFKA-2188:
-----------------------------------------

hey tim, I had a look at the proposal, and I have some feedback, mostly 
questions at this point. I like this improvement, and in general, I've found 
that we can improve quite a bit exception handling in Kafka. This is clearly 
one such great effort. Specifically, here are more concrete points:

# In the exception handler section, I'd say that the best approach is to be 
conservative and remove the drive in the case of an error. Let's not optimize 
too much trying to get the exact partitions that are affected by an error and 
such. If there is an error, then let an operator check it out and reinsert the 
drive when fixed. As part of this comment, I'd say that it'd be a good feature 
to allow drives to be inserted (manually).
# In the notifying controller discussion, could you be more specific about the 
race you're concerned about? I can tell that you're pointing out to a potential 
race, but I'm not sure what it is.
# Open question 1: disk availability. It's kind of hard to detect exactly what 
happened with a faulty disk. It could be disk full, drive is bad, or even just 
some annoying data corruption. I don't think it is worth spending tons of time 
and effort trying to make a great check. If we spot an error, then remove the 
drive and log it. I don't know if there is any typical mechanism to notify 
operators with Kafka.
# Open question 2: log read. I think I know the problem you're referring to, 
and I'll have a look to see if I can suggest some decent alternative, but we 
might need to make it a bit less efficient to be able to handle IO errors 
properly.
# Open question 3: restart partition. This is about the race I asked above. 
# Open question 4: operation retries. What would be a situation in which it is 
worth retrying? 

I was actually wondering if some users would be interested in the case of 
leaving a fraction of the drives unused to replace faulty drives over time. The 
advantage is to be able to maintain the capacity of a broker despite faulty 
drives, but surely you have some unused IO capacity in the broker. 

> JBOD Support
> ------------
>
>                 Key: KAFKA-2188
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2188
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Andrii Biletskyi
>            Assignee: Andrii Biletskyi
>         Attachments: KAFKA-2188.patch, KAFKA-2188.patch, KAFKA-2188.patch
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-18+-+JBOD+Support



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to