Re: Message corruption with new Java client + snappy + broker restart

Jun Rao Wed, 13 May 2015 14:26:32 -0700

Roger,

For 1), the producer actually doesn't move buffered data around during
leadership change. The producer buffer is per partition, independent of
what the current leader of that partition is. Does the issue happen w/o
retry? Does this happen in a particular version of snappy?


For 2), UnknownServerException is not a retriable exception. Could you post
what's in the producer log?

Thanks,

Jun


On Tue, May 12, 2015 at 9:41 PM, Roger Hoover <[email protected]>
wrote:

> Hi,
>
> When using Samza 0.9.0 which uses the new Java producer client and snappy
> enabled, I see messages getting corrupted on the client side.  It never
> happens with the old producer and it never happens with lz4, gzip, or no
> compression.  It only happens when a broker gets restarted (or maybe just
> shutdown).
>
> The error is not always the same.  I've noticed at least three types of
> errors on the Kafka brokers.
>
> 1) java.io.IOException: failed to read chunk
> at
>
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:356)
> http://pastebin.com/NZrrEHxU
> 2) java.lang.OutOfMemoryError: Java heap space
>    at
>
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:346)
> http://pastebin.com/yuxk1BjY
> 3) java.io.IOException: PARSING_ERROR(2)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
> http://pastebin.com/yq98Hx49
>
> I've noticed a couple different behaviors from the Samza producer/job
> A) It goes into a long retry loop where this message is logged.  I saw this
> with error #1 above.
>
> 2015-04-29 18:17:31 Sender [WARN] task[Partition 7]
> ssp[kafka,svc.call.w_deploy.c7tH4YaiTQyBEwAAhQzRXw,7] offset[9999253] Got
> error produce response with correlation id 4878 on topic-partition
> svc.call.w_deploy.T2UDe2PWRYWcVAAAhMOAwA-1, retrying (2147483646 attempts
> left). Error: CORRUPT_MESSAGE
>
> B) The job exits with
> org.apache.kafka.common.errors.UnknownServerException (at least when run as
> ThreadJob).  I saw this with error #3 above.
>
> org.apache.samza.SamzaException: Unable to send message from
> TaskName-Partition 6 to system kafka.
> org.apache.kafka.common.errors.UnknownServerException: The server
> experienced an unexpected error when processing the request
>
> There seem to be two issues here:
>
> 1) When leadership for a topic is transferred to another broker, the Java
> client (I think) has to move the data it was buffering for the original
> leader broker to the buffer for the new leader.  My guess is that the
> corruption is happening at this point.
>
> 2) When a producer has corrupt message, it retries 2.1 billions times in a
> hot loop even though it's not a retriable error.  It probably shouldn't
> retry on such errors.  For retriable errors, it would be much safer to have
> a backoff scheme for retries.
>
> Thanks,
>
> Roger
>

Re: Message corruption with new Java client + snappy + broker restart

Reply via email to