Enrico,

I would suggest you applied my fixes and then debug from there. In this
way, you will have a better sense where the first corruption is from.

Sijie

On Fri, Mar 9, 2018 at 11:48 AM Enrico Olivelli <eolive...@gmail.com> wrote:

> Il ven 9 mar 2018, 19:30 Enrico Olivelli <eolive...@gmail.com> ha scritto:
>
> > Thank you Ivan!
> > I hope I did not mess up the dump and added ZK ports.  We are not using
> > standard ports and in that 3  machines there is also the 3 nodes zk
> > ensemble which is supporting BK and all the other parts of the
> application
> >
> > So one explanation would be that something is connecting to the bookie
> and
> > this makes the bookie switch in a corrupted state by double releasing a
> > bytebuf?
> >
>
> I did some experiments and it is easy to reproduce the bookie side error
> and the double release with a forged sequence of bytes (just using nc from
> the shell)
> But this seems not enough to break the bookie.
> I guess there is some corruption on client side and the error on the bookie
> is only and effect, as Ivan is saying.
> My colleagues left the system running  with a deep level of debug during
> next weekend, hopefully we will get some other stacktrace
>
> Enrico
>
>
> > Enrico
> >
> >
> > Il ven 9 mar 2018, 18:23 Ivan Kelly <iv...@apache.org> ha scritto:
> >
> >> I need to sign off for the day. I've done some analysis of a tcpdump
> >> enrico sent to me out of band (may contain sensitive info so best not
> >> to post on public forum).
> >>
> >> I've attached a dump of just first bit of the header. Format is
> >> <sequence in dump> <whether a request or response>(<remote port>)
> >> <hexdump of payload>
> >>
> >> There are definitely corrupt packets coming from somewhere. Search for
> >> lines with CORRUPT.
> >>
> >> <snip>
> >> 0247 -  req (049546) - 00:00:00:08:ff:ff:ff:fe:00:00:00:0b    CORRUPT
> >> </snip>
> >>
> >> It's not clear whether these are originating at a valid client or not.
> >> These trigger corrupt responses from the server, which I guess is the
> >> double free manifesting itself. Strangely the
> >> corrupt message seems to have a lot of data in common with what seems
> >> like an ok message (it's clearer on fixed width font).
> >>
> >> <snip>
> >> 0248 -  resp(049720) -
> >>
> >>
> 00:00:00:54:00:03:00:89:00:00:02:86:00:07:e2:b1:00:00:00:00:00:00:02:86:00:05:e9:76:00:00
> >> 0249 -  resp(049546) -
> >> 00:00:00:10:ff:ff:ff:fe:00:00:02:86:00:07:e2:b1:00:00:00:00    CORRUPT
> >> </snip>
> >>
> >> There's also some other weird traffic. Correct BK protobuf traffic
> >> should be <4 bytes len>:00:03:....
> >> There seems to be other traffic which is being accepted at the same
> >> port, but looks like ZK traffic.
> >>
> >> Anyhow, I'll dig more on monday.
> >>
> >> -Ivan
> >>
> >>
> >> On Fri, Mar 9, 2018 at 3:27 PM, Ivan Kelly <iv...@apache.org> wrote:
> >> > On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli <eolive...@gmail.com>
> >> wrote:
> >> >> Bookies
> >> >> 10.168.10.117:1822 -> bad bookie with 4.1.21
> >> >> 10.168.10.116:1822 -> bookie with 4.1.12
> >> >> 10.168.10.118:1281 -> bookie with 4.1.12
> >> >>
> >> >> 10.168.10.117 client machine on which I have 4.1.21 client (different
> >> >> process than the bookie one)
> >> > Oh. This dump won't have the stream we need then, as that will be on
> >> > loopback. Try adding "-i any" to the tcpdump. Sorry, I didn't realize
> >> > your clients and servers are colocated.
> >> >
> >> > -Ivan
> >>
> > --
> >
> >
> > -- Enrico Olivelli
> >
> --
>
>
> -- Enrico Olivelli
>

Reply via email to