Oops. I originally sent this to the dev list but meant to send it here.
Hi, > > When using Samza 0.9.0 which uses the new Java producer client and snappy > enabled, I see messages getting corrupted on the client side. It never > happens with the old producer and it never happens with lz4, gzip, or no > compression. It only happens when a broker gets restarted (or maybe just > shutdown). > > The error is not always the same. I've noticed at least three types of > errors on the Kafka brokers. > > 1) java.io.IOException: failed to read chunk > at > org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:356) > http://pastebin.com/NZrrEHxU > 2) java.lang.OutOfMemoryError: Java heap space > at > org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:346) > http://pastebin.com/yuxk1BjY > 3) java.io.IOException: PARSING_ERROR(2) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84) > http://pastebin.com/yq98Hx49 > > I've noticed a couple different behaviors from the Samza producer/job > A) It goes into a long retry loop where this message is logged. I saw > this with error #1 above. > > 2015-04-29 18:17:31 Sender [WARN] task[Partition 7] > ssp[kafka,svc.call.w_deploy.c7tH4YaiTQyBEwAAhQzRXw,7] offset[9999253] Got > error produce response with correlation id 4878 on topic-partition > svc.call.w_deploy.T2UDe2PWRYWcVAAAhMOAwA-1, retrying (2147483646 attempts > left). Error: CORRUPT_MESSAGE > > B) The job exits with > org.apache.kafka.common.errors.UnknownServerException (at least when run as > ThreadJob). I saw this with error #3 above. > > org.apache.samza.SamzaException: Unable to send message from > TaskName-Partition 6 to system kafka. > org.apache.kafka.common.errors.UnknownServerException: The server > experienced an unexpected error when processing the request > > There seem to be two issues here: > > 1) When leadership for a topic is transferred to another broker, the Java > client (I think) has to move the data it was buffering for the original > leader broker to the buffer for the new leader. My guess is that the > corruption is happening at this point. > > 2) When a producer has corrupt message, it retries 2.1 billions times in a > hot loop even though it's not a retriable error. It probably shouldn't > retry on such errors. For retriable errors, it would be much safer to have > a backoff scheme for retries. > > Thanks, > > Roger >