Hi,

When using Samza 0.9.0 which uses the new Java producer client and snappy
enabled, I see messages getting corrupted on the client side.  It never
happens with the old producer and it never happens with lz4, gzip, or no
compression.  It only happens when a broker gets restarted (or maybe just
shutdown).

The error is not always the same.  I've noticed at least three types of
errors on the Kafka brokers.

1) java.io.IOException: failed to read chunk
at
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:356)
http://pastebin.com/NZrrEHxU
2) java.lang.OutOfMemoryError: Java heap space
   at
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:346)
http://pastebin.com/yuxk1BjY
3) java.io.IOException: PARSING_ERROR(2)
  at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
http://pastebin.com/yq98Hx49

I've noticed a couple different behaviors from the Samza producer/job
A) It goes into a long retry loop where this message is logged.  I saw this
with error #1 above.

2015-04-29 18:17:31 Sender [WARN] task[Partition 7]
ssp[kafka,svc.call.w_deploy.c7tH4YaiTQyBEwAAhQzRXw,7] offset[9999253] Got
error produce response with correlation id 4878 on topic-partition
svc.call.w_deploy.T2UDe2PWRYWcVAAAhMOAwA-1, retrying (2147483646 attempts
left). Error: CORRUPT_MESSAGE

B) The job exits with
org.apache.kafka.common.errors.UnknownServerException (at least when run as
ThreadJob).  I saw this with error #3 above.

org.apache.samza.SamzaException: Unable to send message from
TaskName-Partition 6 to system kafka.
org.apache.kafka.common.errors.UnknownServerException: The server
experienced an unexpected error when processing the request

There seem to be two issues here:

1) When leadership for a topic is transferred to another broker, the Java
client (I think) has to move the data it was buffering for the original
leader broker to the buffer for the new leader.  My guess is that the
corruption is happening at this point.

2) When a producer has corrupt message, it retries 2.1 billions times in a
hot loop even though it's not a retriable error.  It probably shouldn't
retry on such errors.  For retriable errors, it would be much safer to have
a backoff scheme for retries.

Thanks,

Roger

Reply via email to