[ https://issues.apache.org/jira/browse/KAFKA-6933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493317#comment-16493317 ]
Franco Bonazza commented on KAFKA-6933: --------------------------------------- I think this is just a tuning issue, I've recovered the state of the sandbox with ulimits and the upgrade, however I'm still concerned that this happens on any form of shutdown. I'm looking at 2 large topics we intend to keep with very long retention with 18 partitions each and 1 GB segments, one has about 83 segments per partition and the other one has about 20, they seem to be the most affected. I would have thought the old segment / indexes to remain untouched / immutable, however this doesn't seem to be the case? I'm sorry I can't find a lot of info about this subject, but where can I find information about what causes indexes to be open / become corrupt and what kind of tuning can we do to ameliorate the situation? The retention we've got now is about 20% of what we want to store in those topics, so I would like to understand if that's a problem. It was my impression that this was not a crazy use-case https://www.confluent.io/blog/okay-store-data-apache-kafka/ > Broker reports Corrupted index warnings apparently infinitely > ------------------------------------------------------------- > > Key: KAFKA-6933 > URL: https://issues.apache.org/jira/browse/KAFKA-6933 > Project: Kafka > Issue Type: Bug > Affects Versions: 1.0.1 > Reporter: Franco Bonazza > Priority: Major > > I'm running into a situation where the server logs show continuously the > following snippet: > {noformat} > [2018-05-23 10:58:56,590] INFO Loading producer state from offset 20601420 > for partition transaction_r10_updates-6 with message format version 2 > (kafka.log.Log) > [2018-05-23 10:58:56,592] INFO Loading producer state from snapshot file > '/data/0/kafka-logs/transaction_r10_updates-6/00000000000020601420.snapshot' > for partition transaction_r10_u > pdates-6 (kafka.log.ProducerStateManager) > [2018-05-23 10:58:56,593] INFO Completed load of log > transaction_r10_updates-6 with 74 log segments, log start offset 0 and log > end offset 20601420 in 5823 ms (kafka.log.Log) > [2018-05-23 10:58:58,761] WARN Found a corrupted index file due to > requirement failed: Corrupt index found, index file > (/data/0/kafka-logs/transaction_r10_updates-15/00000000000020544956.index) > has non-zero size but the last offset is 20544956 which is no larger than the > base offset 20544956.}. deleting > /data/0/kafka-logs/transaction_r10_updates-15/00000000000020544956.timeindex, > /data/0/kafka-logs/transaction_r10_updates-15/00000000000020544956.index, and > /data/0/kafka-logs/transaction_r10_updates-15/00000000000020544956.txnindex > and rebuilding index... (kafka.log.Log) > [2018-05-23 10:58:58,763] INFO Loading producer state from snapshot file > '/data/0/kafka-logs/transaction_r10_updates-15/00000000000020544956.snapshot' > for partition transaction_r10_updates-15 (kafka.log.ProducerStateManager) > [2018-05-23 10:59:02,202] INFO Recovering unflushed segment 20544956 in log > transaction_r10_updates-15. (kafka.log.Log){noformat} > The set up is the following, > Broker is 1.0.1 > There are mirrors from another cluster using client 0.10.2.1 > There are kafka streams and other custom consumer / producers using 1.0.0 > client. > > While is doing this the JVM of the broker is up but it doesn't respond so > it's impossible to produce, consume or run any commands. > If I delete all the index files the WARN turns into an ERROR, which takes a > long time (1 day last time I tried) but eventually it goes into a healthy > state, then I start the producers and things are still healthy, but when I > start the consumers it quickly goes into the original WARN loop, which seems > infinite. > > I couldn't find any references to the problem, it seems to be at least > missreporting the issue, and perhaps it's not infinite? I let it loop over > the WARN for over a day and it never moved past that, and if there was > something really wrong with the state maybe it should be reported. > The log cleaner log showed a few "too many files open" when it originally > happened but ulimit has always been set to unlimited so I'm not sure what > that error means. -- This message was sent by Atlassian JIRA (v7.6.3#76005)