Applyed Sijie's fixes and added some debug: Problem is triggered when you restart a bookie (I have a cluster of 3 bookies, WQ = 2 and AQ = 2)
Below a new error on client side ("tailing" reader) Enrico this is a new error on client side: 18-03-12-09-11-45 Unexpected exception caught by bookie client channel handler 18-03-12-09-11-45 io.netty.handler.codec.DecoderException: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero). io.netty.handler.codec.DecoderException: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero). at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:98) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:979) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:401) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:306) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at java.base/java.lang.Thread.run(Thread.java:844) Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero). at com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:105) at com.google.protobuf.CodedInputStream$StreamDecoder.readTag(CodedInputStream.java:2060) at org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.<init>(BookkeeperProtocol.java:10009) at org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.<init>(BookkeeperProtocol.java:9975) at org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$1.parsePartialFrom(BookkeeperProtocol.java:10869) at org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$1.parsePartialFrom(BookkeeperProtocol.java:10864) at com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2362) at org.apache.bookkeeper.proto.BookkeeperProtocol$Response.<init>(BookkeeperProtocol.java:7852) at org.apache.bookkeeper.proto.BookkeeperProtocol$Response.<init>(BookkeeperProtocol.java:7782) at org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1.parsePartialFrom(BookkeeperProtocol.java:9887) at org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1.parsePartialFrom(BookkeeperProtocol.java:9882) at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:221) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:239) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at com.google.protobuf.GeneratedMessageV3.parseWithIOException(GeneratedMessageV3.java:320) at org.apache.bookkeeper.proto.BookkeeperProtocol$Response.parseFrom(BookkeeperProtocol.java:8454) at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDecoderV3.decode(BookieProtoEncoding.java:329) at org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDecoder.decode(BookieProtoEncoding.java:470) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88) ... 19 more 2018-03-09 22:37 GMT+01:00 Sijie Guo <guosi...@gmail.com>: > Enrico, > > I would suggest you applied my fixes and then debug from there. In this > way, you will have a better sense where the first corruption is from. > > Sijie > > On Fri, Mar 9, 2018 at 11:48 AM Enrico Olivelli <eolive...@gmail.com> > wrote: > > > Il ven 9 mar 2018, 19:30 Enrico Olivelli <eolive...@gmail.com> ha > scritto: > > > > > Thank you Ivan! > > > I hope I did not mess up the dump and added ZK ports. We are not using > > > standard ports and in that 3 machines there is also the 3 nodes zk > > > ensemble which is supporting BK and all the other parts of the > > application > > > > > > So one explanation would be that something is connecting to the bookie > > and > > > this makes the bookie switch in a corrupted state by double releasing a > > > bytebuf? > > > > > > > I did some experiments and it is easy to reproduce the bookie side error > > and the double release with a forged sequence of bytes (just using nc > from > > the shell) > > But this seems not enough to break the bookie. > > I guess there is some corruption on client side and the error on the > bookie > > is only and effect, as Ivan is saying. > > My colleagues left the system running with a deep level of debug during > > next weekend, hopefully we will get some other stacktrace > > > > Enrico > > > > > > > Enrico > > > > > > > > > Il ven 9 mar 2018, 18:23 Ivan Kelly <iv...@apache.org> ha scritto: > > > > > >> I need to sign off for the day. I've done some analysis of a tcpdump > > >> enrico sent to me out of band (may contain sensitive info so best not > > >> to post on public forum). > > >> > > >> I've attached a dump of just first bit of the header. Format is > > >> <sequence in dump> <whether a request or response>(<remote port>) > > >> <hexdump of payload> > > >> > > >> There are definitely corrupt packets coming from somewhere. Search for > > >> lines with CORRUPT. > > >> > > >> <snip> > > >> 0247 - req (049546) - 00:00:00:08:ff:ff:ff:fe:00:00:00:0b CORRUPT > > >> </snip> > > >> > > >> It's not clear whether these are originating at a valid client or not. > > >> These trigger corrupt responses from the server, which I guess is the > > >> double free manifesting itself. Strangely the > > >> corrupt message seems to have a lot of data in common with what seems > > >> like an ok message (it's clearer on fixed width font). > > >> > > >> <snip> > > >> 0248 - resp(049720) - > > >> > > >> > > 00:00:00:54:00:03:00:89:00:00:02:86:00:07:e2:b1:00:00:00:00: > 00:00:02:86:00:05:e9:76:00:00 > > >> 0249 - resp(049546) - > > >> 00:00:00:10:ff:ff:ff:fe:00:00:02:86:00:07:e2:b1:00:00:00:00 > CORRUPT > > >> </snip> > > >> > > >> There's also some other weird traffic. Correct BK protobuf traffic > > >> should be <4 bytes len>:00:03:.... > > >> There seems to be other traffic which is being accepted at the same > > >> port, but looks like ZK traffic. > > >> > > >> Anyhow, I'll dig more on monday. > > >> > > >> -Ivan > > >> > > >> > > >> On Fri, Mar 9, 2018 at 3:27 PM, Ivan Kelly <iv...@apache.org> wrote: > > >> > On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli < > eolive...@gmail.com> > > >> wrote: > > >> >> Bookies > > >> >> 10.168.10.117:1822 -> bad bookie with 4.1.21 > > >> >> 10.168.10.116:1822 -> bookie with 4.1.12 > > >> >> 10.168.10.118:1281 -> bookie with 4.1.12 > > >> >> > > >> >> 10.168.10.117 client machine on which I have 4.1.21 client > (different > > >> >> process than the bookie one) > > >> > Oh. This dump won't have the stream we need then, as that will be on > > >> > loopback. Try adding "-i any" to the tcpdump. Sorry, I didn't > realize > > >> > your clients and servers are colocated. > > >> > > > >> > -Ivan > > >> > > > -- > > > > > > > > > -- Enrico Olivelli > > > > > -- > > > > > > -- Enrico Olivelli > > >