Great analysis Sijie.

Enrico, are these high traffic machines? Would it be feasible to put
tcpdump running? You could even truncate each message to 100 bytes or
so, to avoid storing payloads. It'd be very useful to see what the
corrupt traffic actually looks like.

-Ivan

On Fri, Mar 9, 2018 at 10:43 AM, Enrico Olivelli <eolive...@gmail.com> wrote:
> Reverted to Netty 4.1.12. System is "more" stable but after "some" restart
> we still have errors on client side on tailing readers, rebooting the JMV
> "resolved" temporary the problem.
>
> I have no more errors on the Bookie side
>
> My  idea:
> - client is reading from 2 bookies, there is some bug in this area
> - once the reader client messes up Netty stops working properly and
> corrupts ByteBufs
> - once the reader JMV is promoted to writer (becomes leader and starts
> writing) its internal BookKeeper client is still messed up and writes are
> not able to complete
>
>
> Enrico
>
>
>
>
> 2018-03-09 9:55 GMT+01:00 Enrico Olivelli <eolive...@gmail.com>:
>
>>
>>
>> 2018-03-09 8:59 GMT+01:00 Sijie Guo <guosi...@gmail.com>:
>>
>>> Sent out a PR for the issues that I observed:
>>>
>>> https://github.com/apache/bookkeeper/pull/1240
>>>
>>
>>
>> Other findings:
>> - my problem is not related to jdk9, it happens with jdk8 too
>> - the "tailing reader" is able to make progress and follow the WAL, so not
>> all the reads fail
>> - the "writer" is able to make progress and write to the WAL, so not all
>> the write fail
>>
>> I have run BK 4.6.1 tests with Netty 4.1.21Final but there is no issue
>> (quite the OS as the machines with the error, I am on Fedora, machines are
>> on CentOS)
>>
>> Enrico
>>
>>
>>>
>>>
>>> On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo <guosi...@gmail.com> wrote:
>>>
>>>> So the problem here is:
>>>>
>>>> - a corrupted request failed the V3 request decoder, so bookie switched
>>>> to use v2 request decoder. Once the switch happen, the bookie will always
>>>> use v2 request decoder decoding v3 request. then all your v3 requests will
>>>> be failing with unknown op and trigger the bytebuf double releasing issue.
>>>>
>>>> - a corrupted response failed the V3 response decoder, so client
>>>> switched to use v3 response decoder. Once the switch happen, the client
>>>> will always use v2 request decoder decoding v3 response. so all the
>>>> responses will be failing with " Received unknown response : op code"
>>>>
>>>> Although I don't know how the first request/response happened (it can be
>>>> any issue, even network corruption), the problem is once this happen,
>>>> either client/bookie will be forcing to use v2 request/response decoder and
>>>> never change. so the problem will remain until you restarted. And this is
>>>> the behavior that Enrico is seeing.
>>>>
>>>> There are a couple of issues to address here:
>>>>
>>>> 1) we need to add a flag to disable falling back to use v2
>>>> request/response coder and make sure it always use v3 protocol. In this
>>>> way, we will guarantee the problem not persist in the channel level.
>>>> 2) we need to throw exception on unknown op code at request decode :
>>>> https://github.com/apache/bookkeeper/blob/master/bookkeepe
>>>> r-server/src/main/java/org/apache/bookkeeper/proto/BookiePro
>>>> toEncoding.java#L195 . As what we did at response decoder :
>>>> https://github.com/apache/bookkeeper/blob/master/bookkeepe
>>>> r-server/src/main/java/org/apache/bookkeeper/proto/BookiePro
>>>> toEncoding.java#L304 in https://github.com/apache/bookkeeper/issues/198
>>>>
>>>>
>>>> Details are listed as below:
>>>>
>>>> --
>>>>
>>>> Details:
>>>>
>>>> - The client side stacktrace clearly showed that it is using v2 decoder
>>>> on decoding responses. That means client failed to parse v3 response and it
>>>> falls back to use v2 decoder on decoding responses. Because it is a
>>>> "corrupted" v3 response, so v2 decoder can't
>>>> find a right op code.  Then it throws illegal state exception.
>>>>
>>>>
>>>> *Caused by: java.lang.IllegalStateException: Received unknown response :
>>>> op code = 6        at
>>>> org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)*
>>>>
>>>> - For the stacktrace at bookie side:
>>>>
>>>> 1. It is clear that BookieRequestHandler#L77 is called. That means the
>>>> message is neither v2 request nor v3 request. It is a ByteBuf.
>>>>
>>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>>>> server/src/main/java/org/apache/bookkeeper/proto/BookieReque
>>>> stHandler.java#L77
>>>>
>>>> 2. V3 request decoder is using protobuf to decode bytes array. If it is
>>>> not a valid v3 request, it will always throw exceptions. so the code
>>>> mentioned above will never be triggered
>>>>
>>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto
>>>> Encoding.java#L344
>>>>
>>>> 3. The only case that BookieRequestHandler#L77 can happen is v2 request
>>>> decoder doesn't parse the buffer. This leads to the bug in
>>>>
>>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto
>>>> Encoding.java#L194
>>>>
>>>>
>>>> - How this can happen?
>>>>
>>>> If the client is using v3 protocol, but both client and bookie side
>>>> stacktraces show it is using v2 protocol, that means there was a corruption
>>>> causing client and bookie switching from v3 protocol to v2 protocol.
>>>>
>>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto
>>>> Encoding.java#L502
>>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto
>>>> Encoding.java#L432
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Mar 8, 2018 at 12:18 AM, Enrico Olivelli <eolive...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>> I am seeing this bad errors in some test environments with 4.6.1
>>>>> The errors appear during rolling restart of the application, the test
>>>>> env is made of 6 machines:
>>>>> - 3 bookies
>>>>> - 3 client machines (with several BK clients, of different sub-systems)
>>>>> - running with 4.6.1 both client and servers
>>>>>
>>>>> *I do not have* reports of this errors from production, already running
>>>>> 4.6.1 for the last month
>>>>>
>>>>> But the problem is quite scary
>>>>>
>>>>> This is a sample of relevant errors on clients (in this case Majordodo
>>>>> brokers, with log level = INFO)
>>>>>
>>>>> logs/org.apache.bookkeeper.client.PendingAddOp:
>>>>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>>>>> E2557
>>>>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>>>>> E2558
>>>>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>>>>> E2559
>>>>> 18-03-07-15-59-55       Failed to write entry (366680, 1865): Bookie
>>>>> operation timeout
>>>>> 18-03-07-15-59-55       Failed to write entry (366680, 1865): Bookie
>>>>> operation timeout
>>>>> 18-03-07-16-00-00       Failed to write entry (366680, 1865): Bookie
>>>>> operation timeout
>>>>> 18-03-07-16-00-00       Failed to write entry (366680, 1865): Bookie
>>>>> operation timeout
>>>>> 18-03-07-16-00-05       Failed to write entry (366680, 1865): Bookie
>>>>> operation timeout
>>>>>
>>>>>
>>>>> org.apache.bookkeeper.proto.PerChannelBookieClient:
>>>>>
>>>>> 18-03-07-10-06-44       Unexpected exception caught by bookie client
>>>>> channel handler
>>>>> 18-03-07-10-06-44       io.netty.handler.codec.DecoderException:
>>>>> java.lang.IllegalStateException: Received unknown response : op code =
>>>>> 6
>>>>> io.netty.handler.codec.DecoderException: java.lang.IllegalStateException:
>>>>> Received unknown response : op code = 6
>>>>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>>>> essageToMessageDecoder.java:98)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>>> ByteToMessageDecoder.java:310)
>>>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>>> ByteToMessageDecoder.java:297)
>>>>>         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>>>>> oMessageDecoder.java:413)
>>>>>         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>>>>> ToMessageDecoder.java:265)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>>         at io.netty.channel.DefaultChannelPipeline$HeadContext.channelR
>>>>> ead(DefaultChannelPipeline.java:1414)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>>         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>>>>> ultChannelPipeline.java:945)
>>>>>         at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
>>>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>>>>         at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
>>>>> tLoop.java:404)
>>>>>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.jav
>>>>> a:304)
>>>>>         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin
>>>>> gleThreadEventExecutor.java:886)
>>>>>         at java.base/java.lang.Thread.run(Thread.java:844)
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Caused by: java.lang.IllegalStateException: Received unknown response
>>>>> : op code = 6
>>>>>         at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDe
>>>>> CoderPreV3.decode(BookieProtoEncoding.java:294)
>>>>>         at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDeco
>>>>> der.decode(BookieProtoEncoding.java:478)
>>>>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>>>> essageToMessageDecoder.java:88)
>>>>>
>>>>>
>>>>>         ... 19 more
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Errors on Bookie side, with log level = INFO)
>>>>>
>>>>>
>>>>> io.netty.channel.DefaultChannelPipeline:
>>>>> 8-03-07-09-51-38       An exceptionCaught() event was fired, and it
>>>>> reached at the tail of the pipeline. It usually means the last handler in
>>>>> the pipeline did not handle the exception.
>>>>> 18-03-07-09-51-38       io.netty.util.IllegalReferenceCountException:
>>>>> refCnt: 0, increment: 1
>>>>> io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: 1
>>>>>         at io.netty.buffer.AbstractReferenceCountedByteBuf.release0(Abs
>>>>> tractReferenceCountedByteBuf.java:100)
>>>>>         at io.netty.buffer.AbstractReferenceCountedByteBuf.release(Abst
>>>>> ractReferenceCountedByteBuf.java:84)
>>>>>         at io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil.
>>>>> java:88)
>>>>>         at io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMe
>>>>> ssage(DefaultChannelPipeline.java:1202)
>>>>>         at io.netty.channel.DefaultChannelPipeline$TailContext.channelR
>>>>> ead(DefaultChannelPipeline.java:1299)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>>         at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>>>>> annelInboundHandlerAdapter.java:86)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>>         at org.apache.bookkeeper.proto.BookieRequestHandler.channelRead
>>>>> (BookieRequestHandler.java:78)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>>         at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>>>>> annelInboundHandlerAdapter.java:86)
>>>>>         at org.apache.bookkeeper.proto.AuthHandler$ServerSideHandler.ch
>>>>> annelRead(AuthHandler.java:88)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>>>> essageToMessageDecoder.java:102)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>>> ByteToMessageDecoder.java:310)
>>>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>>> ByteToMessageDecoder.java:297)
>>>>>         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>>>>> oMessageDecoder.java:413)
>>>>>         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>>>>> ToMessageDecoder.java:265)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>>         at io.netty.channel.DefaultChannelPipeline$HeadContext.channelR
>>>>> ead(DefaultChannelPipeline.java:1414)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>>         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>>>>> ultChannelPipeline.java:945)
>>>>>         at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
>>>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>>>>         at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
>>>>> tLoop.java:404)
>>>>>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.jav
>>>>> a:304)
>>>>>         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin
>>>>> gleThreadEventExecutor.java:886)
>>>>>         at java.base/java.lang.Thread.run(Thread.java:844)
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Reply via email to