Reverted to Netty 4.1.12. System is "more" stable but after "some" restart we still have errors on client side on tailing readers, rebooting the JMV "resolved" temporary the problem.
I have no more errors on the Bookie side My idea: - client is reading from 2 bookies, there is some bug in this area - once the reader client messes up Netty stops working properly and corrupts ByteBufs - once the reader JMV is promoted to writer (becomes leader and starts writing) its internal BookKeeper client is still messed up and writes are not able to complete Enrico 2018-03-09 9:55 GMT+01:00 Enrico Olivelli <eolive...@gmail.com>: > > > 2018-03-09 8:59 GMT+01:00 Sijie Guo <guosi...@gmail.com>: > >> Sent out a PR for the issues that I observed: >> >> https://github.com/apache/bookkeeper/pull/1240 >> > > > Other findings: > - my problem is not related to jdk9, it happens with jdk8 too > - the "tailing reader" is able to make progress and follow the WAL, so not > all the reads fail > - the "writer" is able to make progress and write to the WAL, so not all > the write fail > > I have run BK 4.6.1 tests with Netty 4.1.21Final but there is no issue > (quite the OS as the machines with the error, I am on Fedora, machines are > on CentOS) > > Enrico > > >> >> >> On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo <guosi...@gmail.com> wrote: >> >>> So the problem here is: >>> >>> - a corrupted request failed the V3 request decoder, so bookie switched >>> to use v2 request decoder. Once the switch happen, the bookie will always >>> use v2 request decoder decoding v3 request. then all your v3 requests will >>> be failing with unknown op and trigger the bytebuf double releasing issue. >>> >>> - a corrupted response failed the V3 response decoder, so client >>> switched to use v3 response decoder. Once the switch happen, the client >>> will always use v2 request decoder decoding v3 response. so all the >>> responses will be failing with " Received unknown response : op code" >>> >>> Although I don't know how the first request/response happened (it can be >>> any issue, even network corruption), the problem is once this happen, >>> either client/bookie will be forcing to use v2 request/response decoder and >>> never change. so the problem will remain until you restarted. And this is >>> the behavior that Enrico is seeing. >>> >>> There are a couple of issues to address here: >>> >>> 1) we need to add a flag to disable falling back to use v2 >>> request/response coder and make sure it always use v3 protocol. In this >>> way, we will guarantee the problem not persist in the channel level. >>> 2) we need to throw exception on unknown op code at request decode : >>> https://github.com/apache/bookkeeper/blob/master/bookkeepe >>> r-server/src/main/java/org/apache/bookkeeper/proto/BookiePro >>> toEncoding.java#L195 . As what we did at response decoder : >>> https://github.com/apache/bookkeeper/blob/master/bookkeepe >>> r-server/src/main/java/org/apache/bookkeeper/proto/BookiePro >>> toEncoding.java#L304 in https://github.com/apache/bookkeeper/issues/198 >>> >>> >>> Details are listed as below: >>> >>> -- >>> >>> Details: >>> >>> - The client side stacktrace clearly showed that it is using v2 decoder >>> on decoding responses. That means client failed to parse v3 response and it >>> falls back to use v2 decoder on decoding responses. Because it is a >>> "corrupted" v3 response, so v2 decoder can't >>> find a right op code. Then it throws illegal state exception. >>> >>> >>> *Caused by: java.lang.IllegalStateException: Received unknown response : >>> op code = 6 at >>> org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)* >>> >>> - For the stacktrace at bookie side: >>> >>> 1. It is clear that BookieRequestHandler#L77 is called. That means the >>> message is neither v2 request nor v3 request. It is a ByteBuf. >>> >>> https://github.com/apache/bookkeeper/blob/master/bookkeeper- >>> server/src/main/java/org/apache/bookkeeper/proto/BookieReque >>> stHandler.java#L77 >>> >>> 2. V3 request decoder is using protobuf to decode bytes array. If it is >>> not a valid v3 request, it will always throw exceptions. so the code >>> mentioned above will never be triggered >>> >>> https://github.com/apache/bookkeeper/blob/master/bookkeeper- >>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto >>> Encoding.java#L344 >>> >>> 3. The only case that BookieRequestHandler#L77 can happen is v2 request >>> decoder doesn't parse the buffer. This leads to the bug in >>> >>> https://github.com/apache/bookkeeper/blob/master/bookkeeper- >>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto >>> Encoding.java#L194 >>> >>> >>> - How this can happen? >>> >>> If the client is using v3 protocol, but both client and bookie side >>> stacktraces show it is using v2 protocol, that means there was a corruption >>> causing client and bookie switching from v3 protocol to v2 protocol. >>> >>> https://github.com/apache/bookkeeper/blob/master/bookkeeper- >>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto >>> Encoding.java#L502 >>> https://github.com/apache/bookkeeper/blob/master/bookkeeper- >>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto >>> Encoding.java#L432 >>> >>> >>> >>> >>> >>> On Thu, Mar 8, 2018 at 12:18 AM, Enrico Olivelli <eolive...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> I am seeing this bad errors in some test environments with 4.6.1 >>>> The errors appear during rolling restart of the application, the test >>>> env is made of 6 machines: >>>> - 3 bookies >>>> - 3 client machines (with several BK clients, of different sub-systems) >>>> - running with 4.6.1 both client and servers >>>> >>>> *I do not have* reports of this errors from production, already running >>>> 4.6.1 for the last month >>>> >>>> But the problem is quite scary >>>> >>>> This is a sample of relevant errors on clients (in this case Majordodo >>>> brokers, with log level = INFO) >>>> >>>> logs/org.apache.bookkeeper.client.PendingAddOp: >>>> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >>>> E2557 >>>> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >>>> E2558 >>>> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >>>> E2559 >>>> 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie >>>> operation timeout >>>> 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie >>>> operation timeout >>>> 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie >>>> operation timeout >>>> 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie >>>> operation timeout >>>> 18-03-07-16-00-05 Failed to write entry (366680, 1865): Bookie >>>> operation timeout >>>> >>>> >>>> org.apache.bookkeeper.proto.PerChannelBookieClient: >>>> >>>> 18-03-07-10-06-44 Unexpected exception caught by bookie client >>>> channel handler >>>> 18-03-07-10-06-44 io.netty.handler.codec.DecoderException: >>>> java.lang.IllegalStateException: Received unknown response : op code = >>>> 6 >>>> io.netty.handler.codec.DecoderException: java.lang.IllegalStateException: >>>> Received unknown response : op code = 6 >>>> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >>>> essageToMessageDecoder.java:98) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:362) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:348) >>>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>>> ad(AbstractChannelHandlerContext.java:340) >>>> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >>>> ByteToMessageDecoder.java:310) >>>> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >>>> ByteToMessageDecoder.java:297) >>>> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT >>>> oMessageDecoder.java:413) >>>> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte >>>> ToMessageDecoder.java:265) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:362) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:348) >>>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>>> ad(AbstractChannelHandlerContext.java:340) >>>> at io.netty.channel.DefaultChannelPipeline$HeadContext.channelR >>>> ead(DefaultChannelPipeline.java:1414) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:362) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:348) >>>> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa >>>> ultChannelPipeline.java:945) >>>> at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea >>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) >>>> at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven >>>> tLoop.java:404) >>>> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.jav >>>> a:304) >>>> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin >>>> gleThreadEventExecutor.java:886) >>>> at java.base/java.lang.Thread.run(Thread.java:844) >>>> >>> >>> >>> >>> >>> >>>> Caused by: java.lang.IllegalStateException: Received unknown response >>>> : op code = 6 >>>> at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDe >>>> CoderPreV3.decode(BookieProtoEncoding.java:294) >>>> at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDeco >>>> der.decode(BookieProtoEncoding.java:478) >>>> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >>>> essageToMessageDecoder.java:88) >>>> >>>> >>>> ... 19 more >>>> >>> >>> >>> >>> >>> >>> >>> >>>> >>>> >>>> Errors on Bookie side, with log level = INFO) >>>> >>>> >>>> io.netty.channel.DefaultChannelPipeline: >>>> 8-03-07-09-51-38 An exceptionCaught() event was fired, and it >>>> reached at the tail of the pipeline. It usually means the last handler in >>>> the pipeline did not handle the exception. >>>> 18-03-07-09-51-38 io.netty.util.IllegalReferenceCountException: >>>> refCnt: 0, increment: 1 >>>> io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: 1 >>>> at io.netty.buffer.AbstractReferenceCountedByteBuf.release0(Abs >>>> tractReferenceCountedByteBuf.java:100) >>>> at io.netty.buffer.AbstractReferenceCountedByteBuf.release(Abst >>>> ractReferenceCountedByteBuf.java:84) >>>> at io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil. >>>> java:88) >>>> at io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMe >>>> ssage(DefaultChannelPipeline.java:1202) >>>> at io.netty.channel.DefaultChannelPipeline$TailContext.channelR >>>> ead(DefaultChannelPipeline.java:1299) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:362) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:348) >>>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>>> ad(AbstractChannelHandlerContext.java:340) >>>> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch >>>> annelInboundHandlerAdapter.java:86) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:362) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:348) >>>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>>> ad(AbstractChannelHandlerContext.java:340) >>>> at org.apache.bookkeeper.proto.BookieRequestHandler.channelRead >>>> (BookieRequestHandler.java:78) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:362) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:348) >>>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>>> ad(AbstractChannelHandlerContext.java:340) >>>> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch >>>> annelInboundHandlerAdapter.java:86) >>>> at org.apache.bookkeeper.proto.AuthHandler$ServerSideHandler.ch >>>> annelRead(AuthHandler.java:88) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:362) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:348) >>>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>>> ad(AbstractChannelHandlerContext.java:340) >>>> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >>>> essageToMessageDecoder.java:102) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:362) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:348) >>>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>>> ad(AbstractChannelHandlerContext.java:340) >>>> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >>>> ByteToMessageDecoder.java:310) >>>> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >>>> ByteToMessageDecoder.java:297) >>>> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT >>>> oMessageDecoder.java:413) >>>> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte >>>> ToMessageDecoder.java:265) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:362) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:348) >>>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>>> ad(AbstractChannelHandlerContext.java:340) >>>> at io.netty.channel.DefaultChannelPipeline$HeadContext.channelR >>>> ead(DefaultChannelPipeline.java:1414) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:362) >>>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>>> Read(AbstractChannelHandlerContext.java:348) >>>> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa >>>> ultChannelPipeline.java:945) >>>> at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea >>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) >>>> at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven >>>> tLoop.java:404) >>>> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.jav >>>> a:304) >>>> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin >>>> gleThreadEventExecutor.java:886) >>>> at java.base/java.lang.Thread.run(Thread.java:844) >>>> >>>> >>>> >>> >> >