Applyed Sijie's fixes and added some debug:

Problem is triggered when you restart a bookie (I have a cluster of 3
bookies, WQ = 2 and AQ = 2)

Below a new error on client side ("tailing" reader)

Enrico

this is a new error on client side:
18-03-12-09-11-45    Unexpected exception caught by bookie client channel
handler
18-03-12-09-11-45    io.netty.handler.codec.DecoderException:
com.google.protobuf.InvalidProtocolBufferException: Protocol message
contained an invalid tag (zero).
io.netty.handler.codec.DecoderException:
com.google.protobuf.InvalidProtocolBufferException: Protocol message
contained an invalid tag (zero).
    at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:98)
    at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
    at
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
    at
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
    at
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
    at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
    at
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
    at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
    at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
    at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
    at
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:979)
    at
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:401)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:306)
    at
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at java.base/java.lang.Thread.run(Thread.java:844)
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
message contained an invalid tag (zero).
    at
com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:105)
    at
com.google.protobuf.CodedInputStream$StreamDecoder.readTag(CodedInputStream.java:2060)
    at
org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.<init>(BookkeeperProtocol.java:10009)
    at
org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse.<init>(BookkeeperProtocol.java:9975)
    at
org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$1.parsePartialFrom(BookkeeperProtocol.java:10869)
    at
org.apache.bookkeeper.proto.BookkeeperProtocol$ReadResponse$1.parsePartialFrom(BookkeeperProtocol.java:10864)
    at
com.google.protobuf.CodedInputStream$StreamDecoder.readMessage(CodedInputStream.java:2362)
    at
org.apache.bookkeeper.proto.BookkeeperProtocol$Response.<init>(BookkeeperProtocol.java:7852)
    at
org.apache.bookkeeper.proto.BookkeeperProtocol$Response.<init>(BookkeeperProtocol.java:7782)
    at
org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1.parsePartialFrom(BookkeeperProtocol.java:9887)
    at
org.apache.bookkeeper.proto.BookkeeperProtocol$Response$1.parsePartialFrom(BookkeeperProtocol.java:9882)
    at
com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:221)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:239)
    at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49)
    at
com.google.protobuf.GeneratedMessageV3.parseWithIOException(GeneratedMessageV3.java:320)
    at
org.apache.bookkeeper.proto.BookkeeperProtocol$Response.parseFrom(BookkeeperProtocol.java:8454)
    at
org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseEnDecoderV3.decode(BookieProtoEncoding.java:329)
    at
org.apache.bookkeeper.proto.BookieProtoEncoding$ResponseDecoder.decode(BookieProtoEncoding.java:470)
    at
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
    ... 19 more


2018-03-09 22:37 GMT+01:00 Sijie Guo <guosi...@gmail.com>:

> Enrico,
>
> I would suggest you applied my fixes and then debug from there. In this
> way, you will have a better sense where the first corruption is from.
>
> Sijie
>
> On Fri, Mar 9, 2018 at 11:48 AM Enrico Olivelli <eolive...@gmail.com>
> wrote:
>
> > Il ven 9 mar 2018, 19:30 Enrico Olivelli <eolive...@gmail.com> ha
> scritto:
> >
> > > Thank you Ivan!
> > > I hope I did not mess up the dump and added ZK ports.  We are not using
> > > standard ports and in that 3  machines there is also the 3 nodes zk
> > > ensemble which is supporting BK and all the other parts of the
> > application
> > >
> > > So one explanation would be that something is connecting to the bookie
> > and
> > > this makes the bookie switch in a corrupted state by double releasing a
> > > bytebuf?
> > >
> >
> > I did some experiments and it is easy to reproduce the bookie side error
> > and the double release with a forged sequence of bytes (just using nc
> from
> > the shell)
> > But this seems not enough to break the bookie.
> > I guess there is some corruption on client side and the error on the
> bookie
> > is only and effect, as Ivan is saying.
> > My colleagues left the system running  with a deep level of debug during
> > next weekend, hopefully we will get some other stacktrace
> >
> > Enrico
> >
> >
> > > Enrico
> > >
> > >
> > > Il ven 9 mar 2018, 18:23 Ivan Kelly <iv...@apache.org> ha scritto:
> > >
> > >> I need to sign off for the day. I've done some analysis of a tcpdump
> > >> enrico sent to me out of band (may contain sensitive info so best not
> > >> to post on public forum).
> > >>
> > >> I've attached a dump of just first bit of the header. Format is
> > >> <sequence in dump> <whether a request or response>(<remote port>)
> > >> <hexdump of payload>
> > >>
> > >> There are definitely corrupt packets coming from somewhere. Search for
> > >> lines with CORRUPT.
> > >>
> > >> <snip>
> > >> 0247 -  req (049546) - 00:00:00:08:ff:ff:ff:fe:00:00:00:0b    CORRUPT
> > >> </snip>
> > >>
> > >> It's not clear whether these are originating at a valid client or not.
> > >> These trigger corrupt responses from the server, which I guess is the
> > >> double free manifesting itself. Strangely the
> > >> corrupt message seems to have a lot of data in common with what seems
> > >> like an ok message (it's clearer on fixed width font).
> > >>
> > >> <snip>
> > >> 0248 -  resp(049720) -
> > >>
> > >>
> > 00:00:00:54:00:03:00:89:00:00:02:86:00:07:e2:b1:00:00:00:00:
> 00:00:02:86:00:05:e9:76:00:00
> > >> 0249 -  resp(049546) -
> > >> 00:00:00:10:ff:ff:ff:fe:00:00:02:86:00:07:e2:b1:00:00:00:00
> CORRUPT
> > >> </snip>
> > >>
> > >> There's also some other weird traffic. Correct BK protobuf traffic
> > >> should be <4 bytes len>:00:03:....
> > >> There seems to be other traffic which is being accepted at the same
> > >> port, but looks like ZK traffic.
> > >>
> > >> Anyhow, I'll dig more on monday.
> > >>
> > >> -Ivan
> > >>
> > >>
> > >> On Fri, Mar 9, 2018 at 3:27 PM, Ivan Kelly <iv...@apache.org> wrote:
> > >> > On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli <
> eolive...@gmail.com>
> > >> wrote:
> > >> >> Bookies
> > >> >> 10.168.10.117:1822 -> bad bookie with 4.1.21
> > >> >> 10.168.10.116:1822 -> bookie with 4.1.12
> > >> >> 10.168.10.118:1281 -> bookie with 4.1.12
> > >> >>
> > >> >> 10.168.10.117 client machine on which I have 4.1.21 client
> (different
> > >> >> process than the bookie one)
> > >> > Oh. This dump won't have the stream we need then, as that will be on
> > >> > loopback. Try adding "-i any" to the tcpdump. Sorry, I didn't
> realize
> > >> > your clients and servers are colocated.
> > >> >
> > >> > -Ivan
> > >>
> > > --
> > >
> > >
> > > -- Enrico Olivelli
> > >
> > --
> >
> >
> > -- Enrico Olivelli
> >
>

Reply via email to