We've been running Kafka 0.7.0 in production for several months and have been quite happy. Our use case to date has been to pull from the head of our topics, so we're normally consuming within seconds of message production using the high level consumer which is working great as far as I can tell.
Recently I've started pulling older data (usually a few hours old) using the low level consumer, and I'm running into what appears to be corruption in the data files. The consumer pauses for several seconds and then throws "java.io.EOFException: Received -1 when reading from channel, socket has likely been closed". The server log shows "ERROR Closing socket for /xx.xx.xx.xx because of error (kafka.network.Processor) java.io.IOException: Input/output error". And DumpLogSegments reads up to the problematic offset and then stops and reports that the tail of the log is at offset: <bad offset> even though there is more data in the file (the next segment file's starting offset is much higher). I learned here on the mailing list last month that I can skip the rest of the corrupted segment, but I'd rather not be doing that because then I'm losing messages. This has happened 5-6 times in the past month; I've seen it on different brokers, different topics, different partitions, and different segments. So finally, my questions are: - is anyone else pulling older data without issues, or is everyone pretty much always consuming as fast as possible? - is there a known bug that would be fixed with an upgrade to a newer kafka version? I don't know if it's the same problem but I see jira 309 and 310 are marked as fixed but I don't know in which version - is there any way to examine a corrupt file to see what went wrong? Or any way to diagnose why it's happening? Thanks!