[ 
https://issues.apache.org/jira/browse/KAFKA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216486#comment-14216486
 ] 

Joel Koshy commented on KAFKA-1755:
-----------------------------------

There are a couple of issues that I was thinking of in scope for this jira:
* Log cleaner threads quitting on errors (which may be a non-issue as discussed 
further below).
* Dealing with cleaner failures due to unkeyed messages.
* Other cleaner failures are possible as well (for e.g., compressed message 
sets until KAFKA-1374 is reviewed and checked-in)

The reason this jira was filed is because the log cleaner compacts all 
compacted topics so one topic should (ideally) not affect another. Any 
practical deployment would need to set up alerts on the cleaner thread dying. 
Right now, I think the most reliable way to alert (with the currently available 
metrics) would be to monitor the max-dirty-ratio. If we set up this alert, then 
allowing the cleaner to continue would in practice only delay an alert. So one 
can argue that it is better to fail fast - i.e., let the log cleaner die 
because a problematic topic is something that needs to be looked into 
immediately. However, I think there are further improvements with alternatives 
that can be made. It would be helpful if others can share their 
thoughts/preferences on these:
* Introduce a new LogCleaningState: LogCleaningPausedDueToError
* Introduce a metric for the number of live cleaner threads
* If the log cleaner encounters any uncaught error, there are a couple of 
options:
** Don't let the thread die, but move the partition to 
LogCleaningPausedDueToError. Other topics-partitions can still be compacted. 
Alerts can be set up on the number of partitions in state 
LogCleaningPausedDueToError.
** Let the cleaner die and decrement live cleaner count. Alerts can be set up 
on the number of live cleaner threads.
* If the cleaner encounters un-keyed messages:
** Delete those messages, and do nothing. i.e., ignore (or just log the count 
in log cleaner stats)
** Keep the messages, move the partition to LogCleaningPausedDueToError.  The 
motivation for this is accidental misconfiguration. i.e., it may be important 
to not lose those messages. The error log cleaning state can be cleared only by 
deleting and then recreating the topic.
* Additionally, I think we should reject producer requests containing un-keyed 
messages to compacted topics.
* With all of the above, a backup alert can also be set up on the 
max-dirty-ratio.

> Log cleaner thread should not exit on errors
> --------------------------------------------
>
>                 Key: KAFKA-1755
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1755
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Joel Koshy
>              Labels: newbie++
>             Fix For: 0.8.3
>
>
> The log cleaner is a critical process when using compacted topics.
> However, if there is any error in any topic (notably if a key is missing) 
> then the cleaner exits and all other compacted topics will also be adversely 
> affected - i.e., compaction stops across the board.
> This can be improved by just aborting compaction for a topic on any error and 
> keep the thread from exiting.
> Another improvement would be to reject messages without keys that are sent to 
> compacted topics although this is not enough by itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to