2018-03-09 13:48 GMT+01:00 Ivan Kelly <iv...@apache.org>: > Great analysis Sijie. > > Enrico, are these high traffic machines? Would it be feasible to put > tcpdump running? You could even truncate each message to 100 bytes or > so, to avoid storing payloads. It'd be very useful to see what the > corrupt traffic actually looks like. >
It is a really small cluster for manual QA tests (3 application machines + 3 bookie machines), very low traffic. I have restricted the error isolating a single sub-system (the Majordodo broker) which is actually a replicated-state-machine with only one leader and two followers in this case. Any suggestion on the tcpdump config ? (command line example) After reverting to Netty 4.1.12 the error did not disappear but it is rarely reproducible. I will revert to Netty 4.1.21 to run the debug as you are requiring. We are checking for network issues but all the machines are inside the same VMWare system and sharing most of the hardward (but my collegues did not make a report to me yet) Enrico > > -Ivan > > On Fri, Mar 9, 2018 at 10:43 AM, Enrico Olivelli <eolive...@gmail.com> > wrote: > > Reverted to Netty 4.1.12. System is "more" stable but after "some" > restart > > we still have errors on client side on tailing readers, rebooting the JMV > > "resolved" temporary the problem. > > > > I have no more errors on the Bookie side > > > > My idea: > > - client is reading from 2 bookies, there is some bug in this area > > - once the reader client messes up Netty stops working properly and > > corrupts ByteBufs > > - once the reader JMV is promoted to writer (becomes leader and starts > > writing) its internal BookKeeper client is still messed up and writes are > > not able to complete > > > > > > Enrico > > > > > > > > > > 2018-03-09 9:55 GMT+01:00 Enrico Olivelli <eolive...@gmail.com>: > > > >> > >> > >> 2018-03-09 8:59 GMT+01:00 Sijie Guo <guosi...@gmail.com>: > >> > >>> Sent out a PR for the issues that I observed: > >>> > >>> https://github.com/apache/bookkeeper/pull/1240 > >>> > >> > >> > >> Other findings: > >> - my problem is not related to jdk9, it happens with jdk8 too > >> - the "tailing reader" is able to make progress and follow the WAL, so > not > >> all the reads fail > >> - the "writer" is able to make progress and write to the WAL, so not all > >> the write fail > >> > >> I have run BK 4.6.1 tests with Netty 4.1.21Final but there is no issue > >> (quite the OS as the machines with the error, I am on Fedora, machines > are > >> on CentOS) > >> > >> Enrico > >> > >> > >>> > >>> > >>> On Thu, Mar 8, 2018 at 10:47 PM, Sijie Guo <guosi...@gmail.com> wrote: > >>> > >>>> So the problem here is: > >>>> > >>>> - a corrupted request failed the V3 request decoder, so bookie > switched > >>>> to use v2 request decoder. Once the switch happen, the bookie will > always > >>>> use v2 request decoder decoding v3 request. then all your v3 requests > will > >>>> be failing with unknown op and trigger the bytebuf double releasing > issue. > >>>> > >>>> - a corrupted response failed the V3 response decoder, so client > >>>> switched to use v3 response decoder. Once the switch happen, the > client > >>>> will always use v2 request decoder decoding v3 response. so all the > >>>> responses will be failing with " Received unknown response : op code" > >>>> > >>>> Although I don't know how the first request/response happened (it can > be > >>>> any issue, even network corruption), the problem is once this happen, > >>>> either client/bookie will be forcing to use v2 request/response > decoder and > >>>> never change. so the problem will remain until you restarted. And > this is > >>>> the behavior that Enrico is seeing. > >>>> > >>>> There are a couple of issues to address here: > >>>> > >>>> 1) we need to add a flag to disable falling back to use v2 > >>>> request/response coder and make sure it always use v3 protocol. In > this > >>>> way, we will guarantee the problem not persist in the channel level. > >>>> 2) we need to throw exception on unknown op code at request decode : > >>>> https://github.com/apache/bookkeeper/blob/master/bookkeepe > >>>> r-server/src/main/java/org/apache/bookkeeper/proto/BookiePro > >>>> toEncoding.java#L195 . As what we did at response decoder : > >>>> https://github.com/apache/bookkeeper/blob/master/bookkeepe > >>>> r-server/src/main/java/org/apache/bookkeeper/proto/BookiePro > >>>> toEncoding.java#L304 in https://github.com/apache/ > bookkeeper/issues/198 > >>>> > >>>> > >>>> Details are listed as below: > >>>> > >>>> -- > >>>> > >>>> Details: > >>>> > >>>> - The client side stacktrace clearly showed that it is using v2 > decoder > >>>> on decoding responses. That means client failed to parse v3 response > and it > >>>> falls back to use v2 decoder on decoding responses. Because it is a > >>>> "corrupted" v3 response, so v2 decoder can't > >>>> find a right op code. Then it throws illegal state exception. > >>>> > >>>> > >>>> *Caused by: java.lang.IllegalStateException: Received unknown > response : > >>>> op code = 6 at > >>>> org.apache.bookkeeper.proto.BookieProtoEncoding$ > ResponseEnDeCoderPreV3.decode(BookieProtoEncoding.java:294)* > >>>> > >>>> - For the stacktrace at bookie side: > >>>> > >>>> 1. It is clear that BookieRequestHandler#L77 is called. That means the > >>>> message is neither v2 request nor v3 request. It is a ByteBuf. > >>>> > >>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper- > >>>> server/src/main/java/org/apache/bookkeeper/proto/BookieReque > >>>> stHandler.java#L77 > >>>> > >>>> 2. V3 request decoder is using protobuf to decode bytes array. If it > is > >>>> not a valid v3 request, it will always throw exceptions. so the code > >>>> mentioned above will never be triggered > >>>> > >>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper- > >>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto > >>>> Encoding.java#L344 > >>>> > >>>> 3. The only case that BookieRequestHandler#L77 can happen is v2 > request > >>>> decoder doesn't parse the buffer. This leads to the bug in > >>>> > >>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper- > >>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto > >>>> Encoding.java#L194 > >>>> > >>>> > >>>> - How this can happen? > >>>> > >>>> If the client is using v3 protocol, but both client and bookie side > >>>> stacktraces show it is using v2 protocol, that means there was a > corruption > >>>> causing client and bookie switching from v3 protocol to v2 protocol. > >>>> > >>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper- > >>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto > >>>> Encoding.java#L502 > >>>> https://github.com/apache/bookkeeper/blob/master/bookkeeper- > >>>> server/src/main/java/org/apache/bookkeeper/proto/BookieProto > >>>> Encoding.java#L432 > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Thu, Mar 8, 2018 at 12:18 AM, Enrico Olivelli <eolive...@gmail.com > > > >>>> wrote: > >>>> > >>>>> Hi all, > >>>>> I am seeing this bad errors in some test environments with 4.6.1 > >>>>> The errors appear during rolling restart of the application, the test > >>>>> env is made of 6 machines: > >>>>> - 3 bookies > >>>>> - 3 client machines (with several BK clients, of different > sub-systems) > >>>>> - running with 4.6.1 both client and servers > >>>>> > >>>>> *I do not have* reports of this errors from production, already > running > >>>>> 4.6.1 for the last month > >>>>> > >>>>> But the problem is quite scary > >>>>> > >>>>> This is a sample of relevant errors on clients (in this case > Majordodo > >>>>> brokers, with log level = INFO) > >>>>> > >>>>> logs/org.apache.bookkeeper.client.PendingAddOp: > >>>>> 18-03-07-09-51-43 Write of ledger entry to quorum failed: > L366634 > >>>>> E2557 > >>>>> 18-03-07-09-51-43 Write of ledger entry to quorum failed: > L366634 > >>>>> E2558 > >>>>> 18-03-07-09-51-43 Write of ledger entry to quorum failed: > L366634 > >>>>> E2559 > >>>>> 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie > >>>>> operation timeout > >>>>> 18-03-07-15-59-55 Failed to write entry (366680, 1865): Bookie > >>>>> operation timeout > >>>>> 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie > >>>>> operation timeout > >>>>> 18-03-07-16-00-00 Failed to write entry (366680, 1865): Bookie > >>>>> operation timeout > >>>>> 18-03-07-16-00-05 Failed to write entry (366680, 1865): Bookie > >>>>> operation timeout > >>>>> > >>>>> > >>>>> org.apache.bookkeeper.proto.PerChannelBookieClient: > >>>>> > >>>>> 18-03-07-10-06-44 Unexpected exception caught by bookie client > >>>>> channel handler > >>>>> 18-03-07-10-06-44 io.netty.handler.codec.DecoderException: > >>>>> java.lang.IllegalStateException: Received unknown response : op > code = > >>>>> 6 > >>>>> io.netty.handler.codec.DecoderException: java.lang. > IllegalStateException: > >>>>> Received unknown response : op code = 6 > >>>>> at io.netty.handler.codec.MessageToMessageDecoder. > channelRead(M > >>>>> essageToMessageDecoder.java:98) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:362) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:348) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > fireChannelRe > >>>>> ad(AbstractChannelHandlerContext.java:340) > >>>>> at io.netty.handler.codec.ByteToMessageDecoder. > fireChannelRead( > >>>>> ByteToMessageDecoder.java:310) > >>>>> at io.netty.handler.codec.ByteToMessageDecoder. > fireChannelRead( > >>>>> ByteToMessageDecoder.java:297) > >>>>> at io.netty.handler.codec.ByteToMessageDecoder. > callDecode(ByteT > >>>>> oMessageDecoder.java:413) > >>>>> at io.netty.handler.codec.ByteToMessageDecoder. > channelRead(Byte > >>>>> ToMessageDecoder.java:265) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:362) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:348) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > fireChannelRe > >>>>> ad(AbstractChannelHandlerContext.java:340) > >>>>> at io.netty.channel.DefaultChannelPipeline$ > HeadContext.channelR > >>>>> ead(DefaultChannelPipeline.java:1414) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:362) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:348) > >>>>> at io.netty.channel.DefaultChannelPipeline. > fireChannelRead(Defa > >>>>> ultChannelPipeline.java:945) > >>>>> at io.netty.channel.epoll.AbstractEpollStreamChannel$ > EpollStrea > >>>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) > >>>>> at io.netty.channel.epoll.EpollEventLoop.processReady( > EpollEven > >>>>> tLoop.java:404) > >>>>> at io.netty.channel.epoll.EpollEventLoop.run( > EpollEventLoop.jav > >>>>> a:304) > >>>>> at io.netty.util.concurrent.SingleThreadEventExecutor$5. > run(Sin > >>>>> gleThreadEventExecutor.java:886) > >>>>> at java.base/java.lang.Thread.run(Thread.java:844) > >>>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>> Caused by: java.lang.IllegalStateException: Received unknown > response > >>>>> : op code = 6 > >>>>> at org.apache.bookkeeper.proto.BookieProtoEncoding$ > ResponseEnDe > >>>>> CoderPreV3.decode(BookieProtoEncoding.java:294) > >>>>> at org.apache.bookkeeper.proto.BookieProtoEncoding$ > ResponseDeco > >>>>> der.decode(BookieProtoEncoding.java:478) > >>>>> at io.netty.handler.codec.MessageToMessageDecoder. > channelRead(M > >>>>> essageToMessageDecoder.java:88) > >>>>> > >>>>> > >>>>> ... 19 more > >>>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>> > >>>>> > >>>>> Errors on Bookie side, with log level = INFO) > >>>>> > >>>>> > >>>>> io.netty.channel.DefaultChannelPipeline: > >>>>> 8-03-07-09-51-38 An exceptionCaught() event was fired, and it > >>>>> reached at the tail of the pipeline. It usually means the last > handler in > >>>>> the pipeline did not handle the exception. > >>>>> 18-03-07-09-51-38 io.netty.util.IllegalReferenceCountException > : > >>>>> refCnt: 0, increment: 1 > >>>>> io.netty.util.IllegalReferenceCountException: refCnt: 0, increment: > 1 > >>>>> at io.netty.buffer.AbstractReferenceCountedByteBu > f.release0(Abs > >>>>> tractReferenceCountedByteBuf.java:100) > >>>>> at io.netty.buffer.AbstractReferenceCountedByteBu > f.release(Abst > >>>>> ractReferenceCountedByteBuf.java:84) > >>>>> at io.netty.util.ReferenceCountUtil.release( > ReferenceCountUtil. > >>>>> java:88) > >>>>> at io.netty.channel.DefaultChannelPipeline. > onUnhandledInboundMe > >>>>> ssage(DefaultChannelPipeline.java:1202) > >>>>> at io.netty.channel.DefaultChannelPipeline$ > TailContext.channelR > >>>>> ead(DefaultChannelPipeline.java:1299) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:362) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:348) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > fireChannelRe > >>>>> ad(AbstractChannelHandlerContext.java:340) > >>>>> at io.netty.channel.ChannelInboundHandlerAdapter. > channelRead(Ch > >>>>> annelInboundHandlerAdapter.java:86) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:362) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:348) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > fireChannelRe > >>>>> ad(AbstractChannelHandlerContext.java:340) > >>>>> at org.apache.bookkeeper.proto.BookieRequestHandler. > channelRead > >>>>> (BookieRequestHandler.java:78) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:362) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:348) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > fireChannelRe > >>>>> ad(AbstractChannelHandlerContext.java:340) > >>>>> at io.netty.channel.ChannelInboundHandlerAdapter. > channelRead(Ch > >>>>> annelInboundHandlerAdapter.java:86) > >>>>> at org.apache.bookkeeper.proto. > AuthHandler$ServerSideHandler.ch > >>>>> annelRead(AuthHandler.java:88) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:362) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:348) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > fireChannelRe > >>>>> ad(AbstractChannelHandlerContext.java:340) > >>>>> at io.netty.handler.codec.MessageToMessageDecoder. > channelRead(M > >>>>> essageToMessageDecoder.java:102) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:362) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:348) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > fireChannelRe > >>>>> ad(AbstractChannelHandlerContext.java:340) > >>>>> at io.netty.handler.codec.ByteToMessageDecoder. > fireChannelRead( > >>>>> ByteToMessageDecoder.java:310) > >>>>> at io.netty.handler.codec.ByteToMessageDecoder. > fireChannelRead( > >>>>> ByteToMessageDecoder.java:297) > >>>>> at io.netty.handler.codec.ByteToMessageDecoder. > callDecode(ByteT > >>>>> oMessageDecoder.java:413) > >>>>> at io.netty.handler.codec.ByteToMessageDecoder. > channelRead(Byte > >>>>> ToMessageDecoder.java:265) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:362) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:348) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > fireChannelRe > >>>>> ad(AbstractChannelHandlerContext.java:340) > >>>>> at io.netty.channel.DefaultChannelPipeline$ > HeadContext.channelR > >>>>> ead(DefaultChannelPipeline.java:1414) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:362) > >>>>> at io.netty.channel.AbstractChannelHandlerContext. > invokeChannel > >>>>> Read(AbstractChannelHandlerContext.java:348) > >>>>> at io.netty.channel.DefaultChannelPipeline. > fireChannelRead(Defa > >>>>> ultChannelPipeline.java:945) > >>>>> at io.netty.channel.epoll.AbstractEpollStreamChannel$ > EpollStrea > >>>>> mUnsafe.epollInReady(AbstractEpollStreamChannel.java:800) > >>>>> at io.netty.channel.epoll.EpollEventLoop.processReady( > EpollEven > >>>>> tLoop.java:404) > >>>>> at io.netty.channel.epoll.EpollEventLoop.run( > EpollEventLoop.jav > >>>>> a:304) > >>>>> at io.netty.util.concurrent.SingleThreadEventExecutor$5. > run(Sin > >>>>> gleThreadEventExecutor.java:886) > >>>>> at java.base/java.lang.Thread.run(Thread.java:844) > >>>>> > >>>>> > >>>>> > >>>> > >>> > >> >