Reverted to Netty 4.1.12. System is "more" stable but after "some" restart
we still have errors on client side on tailing readers, rebooting the JMV
"resolved" temporary the problem.

I have no more errors on the Bookie side

My  idea:
- client is reading from 2 bookies, there is some bug in this area
- once the reader client messes up Netty stops working properly and
corrupts ByteBufs
- once the reader JMV is promoted to writer (becomes leader and starts
writing) its internal BookKeeper client is still messed up and writes are
not able to complete


Enrico




2018-03-09 9:55 GMT+01:00 Enrico Olivelli <eolive...@gmail.com>:

>
>
> 2018-03-09 8:59 GMT+01:00 Sijie Guo <guosi...@gmail.com>:
>
>> Sent out a PR for the issues that I observed:
>>
>> https://github.com/apache/bookkeeper/pull/1240
>>
>
>
> Other findings:
> - my problem is not related to jdk9, it happens with jdk8 too
> - the "tailing reader" is able to make progress and follow the WAL, so not
> all the reads fail
> - the "writer" is able to make progress and write to the WAL, so not all
> the write fail
>
> I have run BK 4.6.1 tests with Netty 4.1.21Final but there is no issue
> (quite the OS as the machines with the error, I am on Fedora, machines are
> on CentOS)
>
> Enrico
>
>
>>
>>
>> On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo <guosi...@gmail.com> wrote:
>>
>>> So the problem here is:
>>>
>>> - a corrupted request failed the V3 request decoder, so bookie switched
>>> to use v2 request decoder. Once the switch happen, the bookie will always
>>> use v2 request decoder decoding v3 request. then all your v3 requests will
>>> be failing with unknown op and trigger the bytebuf double releasing issue.
>>>
>>> - a corrupted response failed the V3 response decoder, so client
>>> switched to use v3 response decoder. Once the switch happen, the client
>>> will always use v2 request decoder decoding v3 response. so all the
>>> responses will be failing with " Received unknown response : op code"
>>>
>>> Although I don't know how the first request/response happened (it can be
>>> any issue, even network corruption), the problem is once this happen,
>>> either client/bookie will be forcing to use v2 request/response decoder and
>>> never change. so the problem will remain until you restarted. And this is
>>> the behavior that Enrico is seeing.
>>>
>>> There are a couple of issues to address here:
>>>
>>> 1) we need to add a flag to disable falling back to use v2
>>> request/response coder and make sure it always use v3 protocol. In this
>>> way, we will guarantee the problem not persist in the channel level.
>>> 2) we need to throw exception on unknown op code at request decode :
>>> https://github.com/apache/bookkeeper/blob/master/bookkeepe
>>> r-server/src/main/java/org/apache/bookkeeper/proto/BookiePro
>>> toEncoding.java#L195 . As what we did at response decoder :
>>> https://github.com/apache/bookkeeper/blob/master/bookkeepe
>>> r-server/src/main/java/org/apache/bookkeeper/proto/BookiePro
>>> toEncoding.java#L304 in https://github.com/apache/bookkeeper/issues/198
>>>
>>>
>>> Details are listed as below:
>>>
>>> --
>>>
>>> Details:
>>>
>>> - The client side stacktrace clearly showed that it is using v2 decoder
>>> on decoding responses. That means client failed to parse v3 response and it
>>> falls back to use v2 decoder on decoding responses. Because it is a
>>> "corrupted" v3 response, so v2 decoder can't
>>> find a right op code.  Then it throws illegal state exception.
>>>
>>>
>>> *Caused by: java.lang.IllegalStateException: Received unknown response :
>>> op code = 6        at
>>> org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)*
>>>
>>> - For the stacktrace at bookie side:
>>>
>>> 1. It is clear that BookieRequestHandler#L77 is called. That means the
>>> message is neither v2 request nor v3 request. It is a ByteBuf.
>>>
>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>>> server/src/main/java/org/apache/bookkeeper/proto/BookieReque
>>> stHandler.java#L77
>>>
>>> 2. V3 request decoder is using protobuf to decode bytes array. If it is
>>> not a valid v3 request, it will always throw exceptions. so the code
>>> mentioned above will never be triggered
>>>
>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto
>>> Encoding.java#L344
>>>
>>> 3. The only case that BookieRequestHandler#L77 can happen is v2 request
>>> decoder doesn't parse the buffer. This leads to the bug in
>>>
>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto
>>> Encoding.java#L194
>>>
>>>
>>> - How this can happen?
>>>
>>> If the client is using v3 protocol, but both client and bookie side
>>> stacktraces show it is using v2 protocol, that means there was a corruption
>>> causing client and bookie switching from v3 protocol to v2 protocol.
>>>
>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto
>>> Encoding.java#L502
>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper-
>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto
>>> Encoding.java#L432
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Mar 8, 2018 at 12:18 AM, Enrico Olivelli <eolive...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>> I am seeing this bad errors in some test environments with 4.6.1
>>>> The errors appear during rolling restart of the application, the test
>>>> env is made of 6 machines:
>>>> - 3 bookies
>>>> - 3 client machines (with several BK clients, of different sub-systems)
>>>> - running with 4.6.1 both client and servers
>>>>
>>>> *I do not have* reports of this errors from production, already running
>>>> 4.6.1 for the last month
>>>>
>>>> But the problem is quite scary
>>>>
>>>> This is a sample of relevant errors on clients (in this case Majordodo
>>>> brokers, with log level = INFO)
>>>>
>>>> logs/org.apache.bookkeeper.client.PendingAddOp:
>>>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>>>> E2557
>>>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>>>> E2558
>>>> 18-03-07-09-51-43       Write of ledger entry to quorum failed: L366634
>>>> E2559
>>>> 18-03-07-15-59-55       Failed to write entry (366680, 1865): Bookie
>>>> operation timeout
>>>> 18-03-07-15-59-55       Failed to write entry (366680, 1865): Bookie
>>>> operation timeout
>>>> 18-03-07-16-00-00       Failed to write entry (366680, 1865): Bookie
>>>> operation timeout
>>>> 18-03-07-16-00-00       Failed to write entry (366680, 1865): Bookie
>>>> operation timeout
>>>> 18-03-07-16-00-05       Failed to write entry (366680, 1865): Bookie
>>>> operation timeout
>>>>
>>>>
>>>> org.apache.bookkeeper.proto.PerChannelBookieClient:
>>>>
>>>> 18-03-07-10-06-44       Unexpected exception caught by bookie client
>>>> channel handler
>>>> 18-03-07-10-06-44       io.netty.handler.codec.DecoderException:
>>>> java.lang.IllegalStateException: Received unknown response : op code =
>>>> 6
>>>> io.netty.handler.codec.DecoderException: java.lang.IllegalStateException:
>>>> Received unknown response : op code = 6
>>>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>>> essageToMessageDecoder.java:98)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>> ByteToMessageDecoder.java:310)
>>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>> ByteToMessageDecoder.java:297)
>>>>         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>>>> oMessageDecoder.java:413)
>>>>         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>>>> ToMessageDecoder.java:265)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>         at io.netty.channel.DefaultChannelPipeline$HeadContext.channelR
>>>> ead(DefaultChannelPipeline.java:1414)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>>>> ultChannelPipeline.java:945)
>>>>         at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
>>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>>>         at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
>>>> tLoop.java:404)
>>>>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.jav
>>>> a:304)
>>>>         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin
>>>> gleThreadEventExecutor.java:886)
>>>>         at java.base/java.lang.Thread.run(Thread.java:844)
>>>>
>>>
>>>
>>>
>>>
>>>
>>>> Caused by: java.lang.IllegalStateException: Received unknown response
>>>> : op code = 6
>>>>         at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDe
>>>> CoderPreV3.decode(BookieProtoEncoding.java:294)
>>>>         at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDeco
>>>> der.decode(BookieProtoEncoding.java:478)
>>>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>>> essageToMessageDecoder.java:88)
>>>>
>>>>
>>>>         ... 19 more
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>> Errors on Bookie side, with log level = INFO)
>>>>
>>>>
>>>> io.netty.channel.DefaultChannelPipeline:
>>>> 8-03-07-09-51-38       An exceptionCaught() event was fired, and it
>>>> reached at the tail of the pipeline. It usually means the last handler in
>>>> the pipeline did not handle the exception.
>>>> 18-03-07-09-51-38       io.netty.util.IllegalReferenceCountException:
>>>> refCnt: 0, increment: 1
>>>> io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: 1
>>>>         at io.netty.buffer.AbstractReferenceCountedByteBuf.release0(Abs
>>>> tractReferenceCountedByteBuf.java:100)
>>>>         at io.netty.buffer.AbstractReferenceCountedByteBuf.release(Abst
>>>> ractReferenceCountedByteBuf.java:84)
>>>>         at io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil.
>>>> java:88)
>>>>         at io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMe
>>>> ssage(DefaultChannelPipeline.java:1202)
>>>>         at io.netty.channel.DefaultChannelPipeline$TailContext.channelR
>>>> ead(DefaultChannelPipeline.java:1299)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>         at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>>>> annelInboundHandlerAdapter.java:86)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>         at org.apache.bookkeeper.proto.BookieRequestHandler.channelRead
>>>> (BookieRequestHandler.java:78)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>         at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch
>>>> annelInboundHandlerAdapter.java:86)
>>>>         at org.apache.bookkeeper.proto.AuthHandler$ServerSideHandler.ch
>>>> annelRead(AuthHandler.java:88)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>         at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M
>>>> essageToMessageDecoder.java:102)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>> ByteToMessageDecoder.java:310)
>>>>         at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(
>>>> ByteToMessageDecoder.java:297)
>>>>         at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT
>>>> oMessageDecoder.java:413)
>>>>         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte
>>>> ToMessageDecoder.java:265)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe
>>>> ad(AbstractChannelHandlerContext.java:340)
>>>>         at io.netty.channel.DefaultChannelPipeline$HeadContext.channelR
>>>> ead(DefaultChannelPipeline.java:1414)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:362)
>>>>         at io.netty.channel.AbstractChannelHandlerContext.invokeChannel
>>>> Read(AbstractChannelHandlerContext.java:348)
>>>>         at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa
>>>> ultChannelPipeline.java:945)
>>>>         at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea
>>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
>>>>         at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven
>>>> tLoop.java:404)
>>>>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.jav
>>>> a:304)
>>>>         at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin
>>>> gleThreadEventExecutor.java:886)
>>>>         at java.base/java.lang.Thread.run(Thread.java:844)
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to