2018-03-09 8:59 GMT+01:00 Sijie Guo <guosi...@gmail.com>: > Sent out a PR for the issues that I observed: > > https://github.com/apache/bookkeeper/pull/1240 >
Other findings: - my problem is not related to jdk9, it happens with jdk8 too - the "tailing reader" is able to make progress and follow the WAL, so not all the reads fail - the "writer" is able to make progress and write to the WAL, so not all the write fail I have run BK 4.6.1 tests with Netty 4.1.21Final but there is no issue (quite the OS as the machines with the error, I am on Fedora, machines are on CentOS) Enrico > > > On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo <guosi...@gmail.com> wrote: > >> So the problem here is: >> >> - a corrupted request failed the V3 request decoder, so bookie switched >> to use v2 request decoder. Once the switch happen, the bookie will always >> use v2 request decoder decoding v3 request. then all your v3 requests will >> be failing with unknown op and trigger the bytebuf double releasing issue. >> >> - a corrupted response failed the V3 response decoder, so client switched >> to use v3 response decoder. Once the switch happen, the client will always >> use v2 request decoder decoding v3 response. so all the responses will be >> failing with " Received unknown response : op code" >> >> Although I don't know how the first request/response happened (it can be >> any issue, even network corruption), the problem is once this happen, >> either client/bookie will be forcing to use v2 request/response decoder and >> never change. so the problem will remain until you restarted. And this is >> the behavior that Enrico is seeing. >> >> There are a couple of issues to address here: >> >> 1) we need to add a flag to disable falling back to use v2 >> request/response coder and make sure it always use v3 protocol. In this >> way, we will guarantee the problem not persist in the channel level. >> 2) we need to throw exception on unknown op code at request decode : >> https://github.com/apache/bookkeeper/blob/master/bookkeepe >> r-server/src/main/java/org/apache/bookkeeper/proto/ >> BookieProtoEncoding.java#L195 . As what we did at response decoder : >> https://github.com/apache/bookkeeper/blob/master/bookkeepe >> r-server/src/main/java/org/apache/bookkeeper/proto/ >> BookieProtoEncoding.java#L304 in https://github.com/apache/b >> ookkeeper/issues/198 >> >> >> Details are listed as below: >> >> -- >> >> Details: >> >> - The client side stacktrace clearly showed that it is using v2 decoder >> on decoding responses. That means client failed to parse v3 response and it >> falls back to use v2 decoder on decoding responses. Because it is a >> "corrupted" v3 response, so v2 decoder can't >> find a right op code. Then it throws illegal state exception. >> >> >> *Caused by: java.lang.IllegalStateException: Received unknown response : >> op code = 6 at >> org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)* >> >> - For the stacktrace at bookie side: >> >> 1. It is clear that BookieRequestHandler#L77 is called. That means the >> message is neither v2 request nor v3 request. It is a ByteBuf. >> >> https://github.com/apache/bookkeeper/blob/master/bookkeeper- >> server/src/main/java/org/apache/bookkeeper/proto/ >> BookieRequestHandler.java#L77 >> >> 2. V3 request decoder is using protobuf to decode bytes array. If it is >> not a valid v3 request, it will always throw exceptions. so the code >> mentioned above will never be triggered >> >> https://github.com/apache/bookkeeper/blob/master/bookkeeper- >> server/src/main/java/org/apache/bookkeeper/proto/ >> BookieProtoEncoding.java#L344 >> >> 3. The only case that BookieRequestHandler#L77 can happen is v2 request >> decoder doesn't parse the buffer. This leads to the bug in >> >> https://github.com/apache/bookkeeper/blob/master/bookkeeper- >> server/src/main/java/org/apache/bookkeeper/proto/ >> BookieProtoEncoding.java#L194 >> >> >> - How this can happen? >> >> If the client is using v3 protocol, but both client and bookie side >> stacktraces show it is using v2 protocol, that means there was a corruption >> causing client and bookie switching from v3 protocol to v2 protocol. >> >> https://github.com/apache/bookkeeper/blob/master/bookkeeper- >> server/src/main/java/org/apache/bookkeeper/proto/ >> BookieProtoEncoding.java#L502 >> https://github.com/apache/bookkeeper/blob/master/bookkeeper- >> server/src/main/java/org/apache/bookkeeper/proto/ >> BookieProtoEncoding.java#L432 >> >> >> >> >> >> On Thu, Mar 8, 2018 at 12:18 AM, Enrico Olivelli <eolive...@gmail.com> >> wrote: >> >>> Hi all, >>> I am seeing this bad errors in some test environments with 4.6.1 >>> The errors appear during rolling restart of the application, the test >>> env is made of 6 machines: >>> - 3 bookies >>> - 3 client machines (with several BK clients, of different sub-systems) >>> - running with 4.6.1 both client and servers >>> >>> *I do not have* reports of this errors from production, already running >>> 4.6.1 for the last month >>> >>> But the problem is quite scary >>> >>> This is a sample of relevant errors on clients (in this case Majordodo >>> brokers, with log level = INFO) >>> >>> logs/org.apache.bookkeeper.client.PendingAddOp: >>> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >>> E2557 >>> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >>> E2558 >>> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >>> E2559 >>> 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie >>> operation timeout >>> 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie >>> operation timeout >>> 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie >>> operation timeout >>> 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie >>> operation timeout >>> 18-03-07-16-00-05 Failed to write entry (366680, 1865): Bookie >>> operation timeout >>> >>> >>> org.apache.bookkeeper.proto.PerChannelBookieClient: >>> >>> 18-03-07-10-06-44 Unexpected exception caught by bookie client >>> channel handler >>> 18-03-07-10-06-44 io.netty.handler.codec.DecoderException: >>> java.lang.IllegalStateException: Received unknown response : op code = 6 >>> io.netty.handler.codec.DecoderException: java.lang.IllegalStateException: >>> Received unknown response : op code = 6 >>> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >>> essageToMessageDecoder.java:98) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:362) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:348) >>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>> ad(AbstractChannelHandlerContext.java:340) >>> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >>> ByteToMessageDecoder.java:310) >>> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >>> ByteToMessageDecoder.java:297) >>> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT >>> oMessageDecoder.java:413) >>> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte >>> ToMessageDecoder.java:265) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:362) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:348) >>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>> ad(AbstractChannelHandlerContext.java:340) >>> at io.netty.channel.DefaultChannelPipeline$HeadContext.channelR >>> ead(DefaultChannelPipeline.java:1414) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:362) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:348) >>> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa >>> ultChannelPipeline.java:945) >>> at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea >>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) >>> at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven >>> tLoop.java:404) >>> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.jav >>> a:304) >>> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin >>> gleThreadEventExecutor.java:886) >>> at java.base/java.lang.Thread.run(Thread.java:844) >>> >> >> >> >> >> >>> Caused by: java.lang.IllegalStateException: Received unknown response : >>> op code = 6 >>> at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDe >>> CoderPreV3.decode(BookieProtoEncoding.java:294) >>> at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDeco >>> der.decode(BookieProtoEncoding.java:478) >>> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >>> essageToMessageDecoder.java:88) >>> >>> >>> ... 19 more >>> >> >> >> >> >> >> >> >>> >>> >>> Errors on Bookie side, with log level = INFO) >>> >>> >>> io.netty.channel.DefaultChannelPipeline: >>> 8-03-07-09-51-38 An exceptionCaught() event was fired, and it >>> reached at the tail of the pipeline. It usually means the last handler in >>> the pipeline did not handle the exception. >>> 18-03-07-09-51-38 io.netty.util.IllegalReferenceCountException: >>> refCnt: 0, increment: 1 >>> io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: 1 >>> at io.netty.buffer.AbstractReferenceCountedByteBuf.release0(Abs >>> tractReferenceCountedByteBuf.java:100) >>> at io.netty.buffer.AbstractReferenceCountedByteBuf.release(Abst >>> ractReferenceCountedByteBuf.java:84) >>> at io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil. >>> java:88) >>> at io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMe >>> ssage(DefaultChannelPipeline.java:1202) >>> at io.netty.channel.DefaultChannelPipeline$TailContext.channelR >>> ead(DefaultChannelPipeline.java:1299) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:362) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:348) >>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>> ad(AbstractChannelHandlerContext.java:340) >>> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch >>> annelInboundHandlerAdapter.java:86) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:362) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:348) >>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>> ad(AbstractChannelHandlerContext.java:340) >>> at org.apache.bookkeeper.proto.BookieRequestHandler.channelRead >>> (BookieRequestHandler.java:78) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:362) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:348) >>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>> ad(AbstractChannelHandlerContext.java:340) >>> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch >>> annelInboundHandlerAdapter.java:86) >>> at org.apache.bookkeeper.proto.AuthHandler$ServerSideHandler.ch >>> annelRead(AuthHandler.java:88) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:362) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:348) >>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>> ad(AbstractChannelHandlerContext.java:340) >>> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >>> essageToMessageDecoder.java:102) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:362) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:348) >>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>> ad(AbstractChannelHandlerContext.java:340) >>> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >>> ByteToMessageDecoder.java:310) >>> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >>> ByteToMessageDecoder.java:297) >>> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT >>> oMessageDecoder.java:413) >>> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte >>> ToMessageDecoder.java:265) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:362) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:348) >>> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >>> ad(AbstractChannelHandlerContext.java:340) >>> at io.netty.channel.DefaultChannelPipeline$HeadContext.channelR >>> ead(DefaultChannelPipeline.java:1414) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:362) >>> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >>> Read(AbstractChannelHandlerContext.java:348) >>> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa >>> ultChannelPipeline.java:945) >>> at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea >>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) >>> at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven >>> tLoop.java:404) >>> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.jav >>> a:304) >>> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(Sin >>> gleThreadEventExecutor.java:886) >>> at java.base/java.lang.Thread.run(Thread.java:844) >>> >>> >>> >> >