Il ven 9 mar 2018, 19:30 Enrico Olivelli <eolive...@gmail.com> ha scritto:
> Thank you Ivan! > I hope I did not mess up the dump and added ZK ports. We are not using > standard ports and in that 3 machines there is also the 3 nodes zk > ensemble which is supporting BK and all the other parts of the application > > So one explanation would be that something is connecting to the bookie and > this makes the bookie switch in a corrupted state by double releasing a > bytebuf? > I did some experiments and it is easy to reproduce the bookie side error and the double release with a forged sequence of bytes (just using nc from the shell) But this seems not enough to break the bookie. I guess there is some corruption on client side and the error on the bookie is only and effect, as Ivan is saying. My colleagues left the system running with a deep level of debug during next weekend, hopefully we will get some other stacktrace Enrico > Enrico > > > Il ven 9 mar 2018, 18:23 Ivan Kelly <iv...@apache.org> ha scritto: > >> I need to sign off for the day. I've done some analysis of a tcpdump >> enrico sent to me out of band (may contain sensitive info so best not >> to post on public forum). >> >> I've attached a dump of just first bit of the header. Format is >> <sequence in dump> <whether a request or response>(<remote port>) >> <hexdump of payload> >> >> There are definitely corrupt packets coming from somewhere. Search for >> lines with CORRUPT. >> >> <snip> >> 0247 - req (049546) - 00:00:00:08:ff:ff:ff:fe:00:00:00:0b CORRUPT >> </snip> >> >> It's not clear whether these are originating at a valid client or not. >> These trigger corrupt responses from the server, which I guess is the >> double free manifesting itself. Strangely the >> corrupt message seems to have a lot of data in common with what seems >> like an ok message (it's clearer on fixed width font). >> >> <snip> >> 0248 - resp(049720) - >> >> 00:00:00:54:00:03:00:89:00:00:02:86:00:07:e2:b1:00:00:00:00:00:00:02:86:00:05:e9:76:00:00 >> 0249 - resp(049546) - >> 00:00:00:10:ff:ff:ff:fe:00:00:02:86:00:07:e2:b1:00:00:00:00 CORRUPT >> </snip> >> >> There's also some other weird traffic. Correct BK protobuf traffic >> should be <4 bytes len>:00:03:.... >> There seems to be other traffic which is being accepted at the same >> port, but looks like ZK traffic. >> >> Anyhow, I'll dig more on monday. >> >> -Ivan >> >> >> On Fri, Mar 9, 2018 at 3:27 PM, Ivan Kelly <iv...@apache.org> wrote: >> > On Fri, Mar 9, 2018 at 3:20 PM, Enrico Olivelli <eolive...@gmail.com> >> wrote: >> >> Bookies >> >> 10.168.10.117:1822 -> bad bookie with 4.1.21 >> >> 10.168.10.116:1822 -> bookie with 4.1.12 >> >> 10.168.10.118:1281 -> bookie with 4.1.12 >> >> >> >> 10.168.10.117 client machine on which I have 4.1.21 client (different >> >> process than the bookie one) >> > Oh. This dump won't have the stream we need then, as that will be on >> > loopback. Try adding "-i any" to the tcpdump. Sorry, I didn't realize >> > your clients and servers are colocated. >> > >> > -Ivan >> > -- > > > -- Enrico Olivelli > -- -- Enrico Olivelli