(switch to dev@) @Sijie very good explanation.
I am back to work I we have found errors even on a client reader which is performing tailing reads -03-09-08-34-19 io.netty.handler.codec.DecoderException: java.lang.IllegalStateException: Received unknown response : op code = 9 io.netty.handler.codec.DecoderException: java.lang.IllegalStateException: Received unknown response : op code = 9 at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:98) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1414) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:945) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:404) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:304) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:886) at java.base/java.lang.Thread.run(Thread.java:844) Caused by: java.lang.IllegalStateException: Received unknown response : op code = 9 at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294) at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDecoder.decode(BookieProtoEncoding.java:474) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88) ... 19 more I am now running BK tests using that version of Netty, in parallel we are reverting Netty to 4.1.12 on QA env The previous version of the application was running with Netty 4.1.12 Stay tuned Enrico 2018-03-09 7:47 GMT+01:00 Sijie Guo <guosi...@gmail.com>: > So the problem here is: > > - a corrupted request failed the V3 request decoder, so bookie switched to > use v2 request decoder. Once the switch happen, the bookie will always use > v2 request decoder decoding v3 request. then all your v3 requests will be > failing with unknown op and trigger the bytebuf double releasing issue. > > - a corrupted response failed the V3 response decoder, so client switched > to use v3 response decoder. Once the switch happen, the client will always > use v2 request decoder decoding v3 response. so all the responses will be > failing with " Received unknown response : op code" > > Although I don't know how the first request/response happened (it can be > any issue, even network corruption), the problem is once this happen, > either client/bookie will be forcing to use v2 request/response decoder and > never change. so the problem will remain until you restarted. And this is > the behavior that Enrico is seeing. > > There are a couple of issues to address here: > > 1) we need to add a flag to disable falling back to use v2 > request/response coder and make sure it always use v3 protocol. In this > way, we will guarantee the problem not persist in the channel level. > 2) we need to throw exception on unknown op code at request decode : > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L195 . As what we did at response decoder > : https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L304 in https://github.com/apache/ > bookkeeper/issues/198 > > > Details are listed as below: > > -- > > Details: > > - The client side stacktrace clearly showed that it is using v2 decoder on > decoding responses. That means client failed to parse v3 response and it > falls back to use v2 decoder on decoding responses. Because it is a > "corrupted" v3 response, so v2 decoder can't > find a right op code. Then it throws illegal state exception. > > > *Caused by: java.lang.IllegalStateException: Received unknown response : > op code = 6 at > org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)* > > - For the stacktrace at bookie side: > > 1. It is clear that BookieRequestHandler#L77 is called. That means the > message is neither v2 request nor v3 request. It is a ByteBuf. > > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieRequestHandler.java#L77 > > 2. V3 request decoder is using protobuf to decode bytes array. If it is > not a valid v3 request, it will always throw exceptions. so the code > mentioned above will never be triggered > > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L344 > > 3. The only case that BookieRequestHandler#L77 can happen is v2 request > decoder doesn't parse the buffer. This leads to the bug in > > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L194 > > > - How this can happen? > > If the client is using v3 protocol, but both client and bookie side > stacktraces show it is using v2 protocol, that means there was a corruption > causing client and bookie switching from v3 protocol to v2 protocol. > > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L502 > https://github.com/apache/bookkeeper/blob/master/ > bookkeeper-server/src/main/java/org/apache/bookkeeper/ > proto/BookieProtoEncoding.java#L432 > > > > > > On Thu, Mar 8, 2018 at 12:18 AM, Enrico Olivelli <eolive...@gmail.com> > wrote: > >> Hi all, >> I am seeing this bad errors in some test environments with 4.6.1 >> The errors appear during rolling restart of the application, the test env >> is made of 6 machines: >> - 3 bookies >> - 3 client machines (with several BK clients, of different sub-systems) >> - running with 4.6.1 both client and servers >> >> *I do not have* reports of this errors from production, already running >> 4.6.1 for the last month >> >> But the problem is quite scary >> >> This is a sample of relevant errors on clients (in this case Majordodo >> brokers, with log level = INFO) >> >> logs/org.apache.bookkeeper.client.PendingAddOp: >> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >> E2557 >> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >> E2558 >> 18-03-07-09-51-43 Write of ledger entry to quorum failed: L366634 >> E2559 >> 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie >> operation timeout >> 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie >> operation timeout >> 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie >> operation timeout >> 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie >> operation timeout >> 18-03-07-16-00-05 Failed to write entry (366680, 1865): Bookie >> operation timeout >> >> >> org.apache.bookkeeper.proto.PerChannelBookieClient: >> >> 18-03-07-10-06-44 Unexpected exception caught by bookie client >> channel handler >> 18-03-07-10-06-44 io.netty.handler.codec.DecoderException: >> java.lang.IllegalStateException: Received unknown response : op code = 6 >> io.netty.handler.codec.DecoderException: java.lang.IllegalStateException: >> Received unknown response : op code = 6 >> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >> essageToMessageDecoder.java:98) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >> ByteToMessageDecoder.java:310) >> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >> ByteToMessageDecoder.java:297) >> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT >> oMessageDecoder.java:413) >> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte >> ToMessageDecoder.java:265) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.channel.DefaultChannelPipeline$HeadContext. >> channelRead(DefaultChannelPipeline.java:1414) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa >> ultChannelPipeline.java:945) >> at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea >> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) >> at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven >> tLoop.java:404) >> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop. >> java:304) >> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run( >> SingleThreadEventExecutor.java:886) >> at java.base/java.lang.Thread.run(Thread.java:844) >> > > > > > >> Caused by: java.lang.IllegalStateException: Received unknown response : >> op code = 6 >> at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDe >> CoderPreV3.decode(BookieProtoEncoding.java:294) >> at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDeco >> der.decode(BookieProtoEncoding.java:478) >> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >> essageToMessageDecoder.java:88) >> >> >> ... 19 more >> > > > > > > > >> >> >> Errors on Bookie side, with log level = INFO) >> >> >> io.netty.channel.DefaultChannelPipeline: >> 8-03-07-09-51-38 An exceptionCaught() event was fired, and it >> reached at the tail of the pipeline. It usually means the last handler in >> the pipeline did not handle the exception. >> 18-03-07-09-51-38 io.netty.util.IllegalReferenceCountException: >> refCnt: 0, increment: 1 >> io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: 1 >> at io.netty.buffer.AbstractReferenceCountedByteBuf.release0(Abs >> tractReferenceCountedByteBuf.java:100) >> at io.netty.buffer.AbstractReferenceCountedByteBuf.release(Abst >> ractReferenceCountedByteBuf.java:84) >> at io.netty.util.ReferenceCountUtil.release(ReferenceCountUtil. >> java:88) >> at io.netty.channel.DefaultChannelPipeline.onUnhandledInboundMe >> ssage(DefaultChannelPipeline.java:1202) >> at io.netty.channel.DefaultChannelPipeline$TailContext. >> channelRead(DefaultChannelPipeline.java:1299) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch >> annelInboundHandlerAdapter.java:86) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at org.apache.bookkeeper.proto.BookieRequestHandler.channelRead >> (BookieRequestHandler.java:78) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(Ch >> annelInboundHandlerAdapter.java:86) >> at org.apache.bookkeeper.proto.AuthHandler$ServerSideHandler.ch >> annelRead(AuthHandler.java:88) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.handler.codec.MessageToMessageDecoder.channelRead(M >> essageToMessageDecoder.java:102) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >> ByteToMessageDecoder.java:310) >> at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead( >> ByteToMessageDecoder.java:297) >> at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteT >> oMessageDecoder.java:413) >> at io.netty.handler.codec.ByteToMessageDecoder.channelRead(Byte >> ToMessageDecoder.java:265) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.AbstractChannelHandlerContext.fireChannelRe >> ad(AbstractChannelHandlerContext.java:340) >> at io.netty.channel.DefaultChannelPipeline$HeadContext. >> channelRead(DefaultChannelPipeline.java:1414) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:362) >> at io.netty.channel.AbstractChannelHandlerContext.invokeChannel >> Read(AbstractChannelHandlerContext.java:348) >> at io.netty.channel.DefaultChannelPipeline.fireChannelRead(Defa >> ultChannelPipeline.java:945) >> at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStrea >> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) >> at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEven >> tLoop.java:404) >> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop. >> java:304) >> at io.netty.util.concurrent.SingleThreadEventExecutor$5.run( >> SingleThreadEventExecutor.java:886) >> at java.base/java.lang.Thread.run(Thread.java:844) >> >> >> >